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CHAPTER  1 


INTRODUCTION 

Markov  and  semi-Markov  decision  processes  have  been  studied  exten- 
sively since  their  initial  development  in  the  late  1950’s  and  early 
I960’ s.  They  provide  the  natural  framework  for  the  study  of  a plethora 
of  problems  arising  in  the  areas  of  queueing,  inventory,  maintenance 
and  replacement,  etc.  Many  useful  results  about  Markov  and  semi-lferkov 
decision  processes  are  available  now  under  a variety  of  assumptions. 

A common  assumption  has  been  the  assumption  of  bounded  costs.  Although 
bounded  costs  is  an  appropriate  assumption  for  many  problems,  there  are 
also  many  situations,  especially  in  the  context  of  queueing  and  inventory 
fo"  which  it  is  not  appropriate.  Thus,  there  is  a need  for  developing 
a onv.jry  for  Markov  and  semi -Markov  decision  processes  with  unbounded 
costs.  Although  there  have  been  some  efforts  in  this  direction  earner, 
stronger  results  need  to  be  developed.  That  is  the  objective  of  this 
report.  Specifically,  results  are  obtained  for  semi-Markov  decision 
processes  both  when  the  costs  are  discounted  _nd  when  they  are  not. 
Application  to  the  optimal  control  of  queueing  systems  is  also  considered 

The  terminology  of  semi-ferkov  decision  processes  is  summarized  in 
Section  1.  Section  2 then  presents  some  examples  of  semi-tferkov  decision 
processes  both  with  and  without  unbounded  costs.  Section  5 reviews  the 
literature  on  semi-I&rkov  decision  processes.  An  overview  of  the  study 
is  presented  in  Section  4. 
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1.  Terminology  of  Semi-Markov  Decision  Processes. 


The  semi-Markov  decision  process  is  a stochastic  process  which 
requires  certain  decisions  to  be  made  at  certain  points  in  time.  These 
points  in  time  are  the  decision  epochs.  At  each  decision  epoch,  the 
system  under  consideration  is  observed  and  found  to  be  in  a certain  state. 
The  set  of  all  conceivable  states  is  the  state  space.  The  decision 
consists  of  choosing  an  action  from  a set  of  permissible  actions.  This 
set  depends  on  the  state  of  the  system  when  the  decision  has  to  be  made. 
The  set  of  permissible  actions  for  a given  state  is  an  action  space. 

The  union  of  all  action  spaces  is  referred  to  as  the  action  space.  Once 
an  action  has  been  chosen,  the  probabilistic  aspects  of  the  evolution 
of  the  system  until  the  next  decision  epoch  occurs  (including  the  time 
elapsed  and  the  state  of  the  system  at  the  next  decision  epoch)  is  com- 
pletely determined  by  the  state  of  the  system  when  the  action  was  chosen 
and  the  action  itself. 

A policy  for  a semi -Markov  decision  process  is  a rule  which  selects 
an  action  at  each  decision  epoch  by  considering  only  tne  history  of  the 
process  up  to  that  point  in  time.  An  interesting  class  of  policies  is 
the  class  of  stationary  policies.  A stationary  policy  selects  the  action 
at  each  decision  epoch  solely  on  the  basis  of  the  state  of  the  system 
at  the  decision  epoch.  A stationary  policy  is  deterministic  if  it 
selects  the  actions  according  to  a fixed  mapping  from  the  state  space 
into  the  action  space;  otherwise  it  is  randomized. 

A part  of  the  process  is  the  costs  incurred.  The  objective  is  to 
minimize  these  costs.  They  are,  however,  incurred  in  a raMom  fashion 
and  at  different  times,  so  a further  specification  of  the  objective  is 
needed.  There  are  several  alternatives.  If  the  time  factor  is  not 
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important,  one  may  choose  to  minimize  the  total  expected  cost,  or  if  this 
is  not  finite,  the  long-run  expected  average  cost.  If  the  time  factor 
is  important,  one  may  discount  the  costs  and  minimize  the  total  expected 
discounted  cost. 

For  our  purposes,  a semi -Markov  decision  process  is  completely 
specified  by  four  objects,  the  state  space,  S,  the  action  spaces 
(Aj  „,  the  law  of  motion  q,  and  the  cost  function  c.  Let 

S S€o  

A = l J _ A , and  let  R be  the  set  of  real  numbers.  The  law  of 
motion,  q,  is  a mapping  from  S X A x S X R into  R,  and  the  cost 
function,  c,  is  a mapping  from  S XA  XR  into  R.  Consider  a decision 
epoch.  Suppose  the  state  there  is  s and  suppose  the  action  chosen 
there  is  a.  Then,  for  s’  e S and  t e R,  q(s,a,s’,t)  is  the  joint 
probability  that  the  time  until  the  next  decision  epoch  is  less  than  or 
equal  to  t and  that  the  state  at  the  next  decision  epoch  is  s'.  If 
the  times  between  the  decision  epochs  are  constant,  then  we  have  a Markov 
decision  process.  Also,  for  t e R,  c(s,a,t)  is  the  expected  cost 
accumulated  until  time  t.  The  formulation  of  a problem  in  the  framework 
of  semi-tferkov  decision  processes  consists  of  specifying  S,  (A  } 0,  q 

and  c.  Some  examples  of  semi-Markov  decision  processes  are  now  pre- 
sented. 


2.  Examples  of  Sem?  -Markov  Decision  Processes  With  and  .thout  Unbounded 
Costs. 

Selling  an  asset  (Ross  (1970)): 

Consider  a person  who  wants  to  sell  his  house.  Offers  arrive  according 
to  a stationary  Poisson  process.  The  sizes  of  the  offers  are  independent, 
identically  distributed  random  variables.  When  an  offer  arrives,  it 
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must  either  be  accepted  or  rejected.  Rejected  offers  are  lost.  A main- 
tenance cost  is  incurred  at  a constant  non-negative  rate  until  the  house 
is  sold.  The  problem  is  to  decide  when  an  offer  should  be  accepted.  This 
problem  can  be  formulated  within  the  framework  of  a semi-Markov  decision 
process  as  f Hows . 

Let  the  decision  epochs  be  the  same  as  the  epochs  when  offers  arrive, 
let  the  actions  be  bo  accept  or  reject  the  current  offer,  and  let  the 
state  of  the  system  be  the  size  of  the  offer  at  the  most  recent  decision 
epoch. 

A job  shop  model  (Lippman  and  Ross  (1968)): 

Consider  a factory  which  is  only  able  to  handle  one  job  at  a time. 
Jobs  arrive  according  to  a stationary  Poisson  process.  When  a job  arrives 
it  is  classified  to  be  of  a certain  type.  Jobs  of  the  same  type  have 
an  identical  probabilistic  structure  for  their  cost  and  completion  time. 
The  classification  of  arriving  jobs  are  independent,  identically  dis- 
tributed random  variables.  Each  job  must  either  be  accepted  or  rejected. 
Jobs  arriving  when  the  factory  is  busy  are  rejected  automatically.  The 
problem  is  to  determine  when  a job  should  be  accepted  (x-ejected)  when 
the  factory  is  not  busy.  This  problem  can  be  formulated  within  the 
framework  of  semi -Markov  decision  processes  as  follows. 

Let  the  decision  epochs  be  the  same  as  the  epochs  of  job  arrivals 
(neglect  jobs  which  arrive  when  the  factory  is  busy),  let  the  available 
actions  be  to  accept  or  reject  the  job  that  just  arrived,  and  let  the 
state  of  the  system  be  the  type  of  job  present. 

The  m/g/1  queueing  system  with  removable  server  (Heyman  (1968)): 

Consider  a queueing  system  having  one  server  which  can  be  turned 
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on  and  off.  Customers  arrive  according  to  a stationary  Poisson  process. 
They  are  served  one  by  one  on  a first-come-first-served  basis.-  The  service 
times,  are  independent;  identically  distributed  random  variables.  There 
is  a cost  associated  with  the  service  of  each  customer.  These  costs  arc 
independent,  identically  distributed  random  variables.  There  are  fixed 
charges  for  turning  the  server  on  and  off.  There  is  a cost  for  having 
the  server  on  when  there  are  no  customers  in  the  system.  This  cost  is 
incurred  at  a constant  rate  at  such  times.  Finally;  there  is  a cost 
for  holding  customers  in  the  system.  This  cost  is  incurred  at  a rate 
which  is  a non-negative,  non-decreasing  function  of  the  number  of  cus- 
tomers present.  The  problem  is  to  determine  when  the  server  should  be 
turned  on  and  turned  off.  This  problem  can  be  formulated  within  the 
framework  of  semi -Markov  decision  processes  as  follows. 

Let  the  decision  epochs  be  the  epochs  of  customer  arrivals  and 
departures  (neglect  arrivals  which  occur  when  the  server  is  busy).  Let 
the  available  actions  be  to  turn  the  server  off  (or  have  him  off)  and 
to  turn  him' on  (or  have  him  on).  Finally,  let  the  state  of  the  system 
be  a vector  whose  first  component  gives  the  number  of  customers  present, 
and  whose  second  component  shows  the  status  of  the  server. 

3.  A Brief  Survey  of  the  Literature  on  Semi-iferkov  Decision  Processes. 

The  first  comprehensive  study  of  Markov  decision  processes  w as  done 
by  Howard  (i960).  He  assumed  finite  state  and  action  spaces,  and  con- 
sidered the  problem  both  with  and  without  discounting.  He  only  considered 
stationary  policies,  and  developed  his  now  well-known  policy  improvement 
procedures.  He  proved  that  they  would  produce  optimal  stationary  policies. 

At  the  same  time,  Ifenne  (i960)  suggested  solving  the  Markov  decision 
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problem  by  using  linear  programming.  He  used  the  average  cost  criterion, 
and  showed  how  to  solve  an  inventory  problem  by  his  suggested  approach. 

The  first  linear  programming  formulation  for  he  problem  with  discounting 
was  given  by  d’Epenoux  (i960).  Shortly  afterwards,  Wolfe  and  Dantzig 
(1962)  proposed  the  use  of  their  decomposition  technique  on  Manne's 
linear  programming  formulation. 

Blackwell  (1962)  considered  Markov  decision  processes  with  finite 
state  and  action  spaces,  and  proved  that  there  is  a stationary  policy 
which  is  optimal  among  all  Markov  policies.  He  also  considered  the 
problem  for  arbitrarily  small  interest  rates,  and  proved  that  there  is 
a stationary  policy  which  is  optimal  among  all  khrkov  policies  for  small 
enough  interest  rates.  Later,  Blackwell  (1965)  considered  Markov  decision 
processes  with  more  general  state  and  action  spaces.  He  only  assumed 
that  they  were  Borel  sets.  However,  he  assumed  that  the  rewards  were 
uniformly  bounded.  He  considered  the  problem  with  discounting,  and 
allowed  any  measurable  policy.  His  main  results  were  the  following. 

There  is  a (p,e) -optimal  stationary  policy.  If  the  action  spaces  are 
countable,  then  there  is  an  e-optimal  stationary  policy.  If  the  action 
spaces  are  finite,  then  there  is  an  optimal  stationary  policy.  If  there 
is  an  optimal  policy,  then  there  is  one  which  is  stationary. 

Btrauch  (1966)  considered  the  same  problem  as  Blackwell,  but  instead 
of  using  discounting,  he  assumed  that  the  rewards  were  negative.  His 
main  results  were  similar  to  those  of  Blackwell.  If  the  action  spaces 
are  finite,  then  there  is  an  optimal  policy.  If  there  is  an  optimal 
policy,  then  there  is  one  which  is  stationary.  The  optimal  return  function 
is  measurable  and  satisfies  the  optimality  equation. 
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Denardo  (1967)  a^-so  considered  the  same  problem  as  Blackwell  and 
generalized  it  to  include  certain  stochastic  games.  Ke  introduced  oper- 
ators with  certain  me  notonicity  and  contraction  properties,  and  used 
the  Picard-Banach  fixed  point  theorem  to  prove  that  the  functional  equation 
of  optimality  has  a unique  solution,  which  is  the  optimal  reward  function. 

Veinott  (1966)  gave  a policy  iteration  procedure  for  finding  a bias- 
optimal  policy  (no  discounting).  Later,  Veinott  (1969)  considered  a 
more  refined  optimality  criterion,  namely,  that  of  finding  a policy  which 
is  optimal  for  all  sufficiently  small  interest  rates  (sensitive  discount- 
optimality).  He  developed  a policy  iteration  procedure  for  finding  a 
stationary  policy  which  would  be  optimal  according  to  this  criterion. 

Derman  (1966)  considered  Markov  decision  processes  with  finite 
action  spaces  and  a countable  state  space.  He  used  the  average  cost 
criterion,  and  gave  conditions  for  when  a stationary,  deterministic 
policy  is  optimal.  Ross  ( 1968 ) considered  the  same  problem,  but  allowed 
a general  state  space.  He  derived  results  similar  to  those  of  Derma n. 

He  also  suggested  a method  for  converting  the  average  cost  problem  to 
a discounted  cost  problem. 

One  of  the  first  to  consider  semi-Markov  processes  was  Pyke  (1961). 
Shortly  afterwards,  Howard’s  results  for  Markov  decision  processes  were 
extended  to  semi-Markov  decision  processes  independently  by  Jewell  (1963) 
and  Howard  (1964),  When  they  considered  the  average  cost  criterion, 
they  assumed  that  all  states  belong  to  one  positive  recurrent  class. 

They  also  gave  linear  programming  formulations. 

Denardo  and  Fox  (1968)  considered  the  multi-chain  case  (i.e.,  the 
case  of  several  positive  recurrent  classes),  using  the  average  cost 
criterion.  They  gave  a linear  programming  formulation  and  a policy 
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improvement  procedure.  Later  Denardo  (1970a)  developed  a solution  method 
which  used  Manne's  linear  programming  formulation  to  solve  a sequence  of 
subproblems.  This  solution  method  has  the  advantage  that  several  small 
linear  programming  problems  are  solved  instead  of  one  big  one.  Denardo 
(1971)  also  considered  the  problem  when  small  interest  rates  are  used. 

His  results  are  similar  to  those  of  Veinott  for  the  discrete-time  Markov 
decision  process.  He  gives  a sequence  of  linear  programming  problems 
for  finding  an  optimal  policy. 

All  of  these  authors  have  assumed  that  the  immediate  rewards  or  costs 
are  bounded  uniformly.  After  Strauch,  Harrison  (1972)  was  the  first  one 
to  relax  the  condition  of  bounded  costs.  He  assumed  that  the  expected 
absolute  reward  in  one  period  minus  the  expected  absolute  reward  in  the 
period  before  it,  given  the  state  at  the  beginning  of  that  period,  is 
uniformly  bounded.  He  then  showed  that  the  expected  discounted  reward 
is  finite  for  each  policy  and  that  there  exists  a stationary  policy 
which  is  optimal.  He  proved  this  by  using  the  Picard-Banach  fixed  point 
theorem.  He  also  extended  his  results  from  Markov  decision  processes 

to  semi -Markov  decision  processes. 

\ 

The  problem  with  unbounded  costs  was  also  considered  by  Reed  (1973) • 
He  investigated  the  problem  both  with  and  without  discounting.  He  assumed 
finite  action  spaces  and  countable  state  space.  He  gave  sufficient  con- 
ditions for  a stationary  policy  to  be  optimal. 

Hordijk  (1974a),  (1974b)  also  considered  the  problem  with  unbounded 
costs.  He  introduced  the  notion  of  convergent  dynamic  programming,  which 
is  jusu  to  say  that  the  expectation  of  the  sum  of  the  absolute  rewards 
is  finite.  He  proved  that  a policy  is  optimal  if  it  is  unimprovable  and 
if  another  condition  is  satisfied. 


8 


Most  recently,  Lippman  (1973)?  (1975a) considered  the  problem  with 
unbounded  costs.  His  approach  is  to  use  a norm  such  that  the  norm  of 
the  costs  is  finite  even  though  the  costs  are  unbounded.  In  order  to 
obtain  the  usual  results,  he  then  has  to  make  assumptions  about  the 
law  of  motion  of  the  system.  By  doing  that,  he  showed  that  Denardo's 
N-stage  contraction  assumption  is  satisfied,  and  the  results  follow. 

4-.  Overview  of  the  Study. 

The  emphasis  of  this  report  is  on  c-oerKining  necessary  and  sufficient 
conditions  for  a stationary  policy  to  be  optimal.  It  is  not  assumed  that 
the  costs  are  bounded.  The  problem  is  considered  both  with  and  without 
discounting. 

Chapter  2 treats  the  problem  without  discounting.  Two  closely 
related  optimality  criteria  are  used,  namely,  the  average  cost  criterion 
and  the  undiscounted  cost  criterion.  After  introducing  the  important 
concept  of  an  unimprovable  policy,  sufficient  conditions  are  given  f a 
unimprovable  policy  to  be  optimal.  Both  the  special  case  where  the 
optimal  expected  average  cost  is  independent  of  the  start-state  and  the 
general  case  when  the  average  cost  is  not  necessarily  constant  are  con- 
sidered. 

Chapter  5 treats  the  p ' 'm  with  discounting.  After  formulating 
the  problem  and  introducing  ,i«  operators  and  T^,  the  optimality 
equation  is  proven.  The  existence  of  stationary  optimal  and  stationary 
e-optimal  policies  are  then  investigated.  Policy  improvement  is  con- 
sidered, and  some  necessary  and  sufficient  conditions  for  optimality 
are  given. 

Chapter  4 is  devoted  to  the  optimal  control  of  queueing  systems. 
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Solution  methods  are  explored;  and  four  different  ways  of  solving  the 
problem  of  unbounded  costs  are  presented. 

Some  general  notation  and  conventions  are  best  introduced  here.  R 
denotes  the  set  of  real  numbers,  R denotes  the  set  of  non-negative 

T 

real  numbers,  N denotes  the  set  of  natural  numbers  (starting  with  one) 
and  N denotes  the  non-negative  integers.  The  Kroene-^r  delta  function 
5 is  defined  by 


8(x,y) 


if  x = y , 
if  x f y . 


4" 

If  x is  a real  number,  then  x is  max(0,x)  and  x-  is  max(0,-x). 
Finally,  we  use  the  convention  that 


" >»  if 

X 

ii 

8 

y 

> _00 

x + y = - 

<m 

•H 

8, 

X 

< 00 , 

y 

= —CO 

(undef ined  if  x = -y  = + 00  . 
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CHAPTER  2 


SEMI-MRKOV  DECISION  PROCESSES  WITHOUT  DISCOUNTING 

This  chapter  presents  an  investigation  of  semi-Markov  decision 
processes  without  discounting  the  costs.  Thus,  costs  of  equal  size 
incurred  at  different  times  count  the  same.  Two  optimality  criteria 
are  used.  The  first  one  is  the  average  cost  criterion,  according  to 
which  a policy  is  optimal  if  the  long-run  expected  average  cost  is 
minimized  by  this  policy.  This  criterion  has  been  considered  recently 
by  Hordijk  (1974a) . The  other  criterion  is  the  undiscounted  cost 
criterion.  A policy  is  optimal  under  this  criterion  if  it  minimizes 
the  long-run  (total)  expected  cost  for  the  process  which  is  derived  from 
the  original  one  by  incurring  an  additional  cost  at  a rate  equal  to  the 
negative  of  the  minimum  average  cost.  This  criterion  has  been  considered 
by  Denardo  (1970).  He  called  a policy  which  is  optimal  for  this  criterion 
a hias-optimal  policy. 

There  have  traditionally  been  two  approaches  to  the  problem  without 
discounting.  The  first  one  consists  of  restricting  one’s  consideration 
to  stationary  (deterministic)  policies  and  performing  a stationary 
analysis.  The  second  one  consists  of  considering  the  problem  with  dis- 
counting and  observing  what  happens  when  the  interest  rate  goes  to  zero. 
Here,  we  will  follow  the  first  approach.  It  has  been  common  to  assume 
that  the  costs  are  uniformly  bounded.  We  make  no  assumptions  about  the 
size  of  the  costs.  Reed  (1975)  conducted  a similar  but  somewhat  less 
complete  study  of  the  problem. 


In  Section  1,  there  is  a formal  statement  of  the  problem  to  be  con- 
sidered. It  also  contains  some  preliminary  results.  Unimprovable  policies 
are  defined  there.  In  Section  2,  sufficient  conditions  for  an  unimprovable 
policy  to  be  optimal  are  given.  It  is  assumed  that  the  long  run  expected 
average  cost  is  constant.  In  Section  5,  the  results  from  Section  2 are 
extended  to  cover  the  general  case  of  non-constant  long-run  expected 
average  cost.  In  Section  k,  there  is  a brief  discussion  of  methods  for 
finding  an  optimal  policy. 


1.  Problem  Formulation. 

As  before,  let  S be  the  state  space,  [A  ) be  the  action 
} s seS 

spaces,  q be  the  law  of  motion  and  c the  cost  function.  Let  $)  be 

the  set  of  stationary,  deterministic  policies,  and  let  A be  \JfeeS 

For  each  n e N,  let  t , s , and  a denote  the  time  of  the  n 

n n n 

decision  epoch,  the  state  observed  there,  and  the  action  chosen  there, 
respectively. 

For  each  ir  e £),  let  v be  the  mapping  from  G X R+  into  R 
such  that,  for  each  s e S and  t e R , 


where 


v(s,t)  = E_.  { 7)  c(s  ,a  ,t-t  )} 
7T  ’ ir,s  u..  ' n'  n'  n' 


N , = (n  € N|t  < t}  . 

n — 


E is  the  expectation  operator,  and  the  subscripts  ir  and  s respec- 
tively denote  that  the  start-state  is  s and  that  the  policy  used  is 
7 r.  In  words,  is  the  expected  cost  incurred  until  time  t, 

given  that  the  start-state  is  s and  that  the  policy  7 r is  used. 
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need  not  always  be  well-defined.  Later,  however,  certain  assumptions 
which  guarantee  the  existence  of  v^  for  each  v e c&  will  be  made. 

The  analysis  here  is  based  on  the  fact  that  under  certain  conditions 
(to  be  introduced  when  needed),  v^(t,s)  has  a linear  asymtote  for  each 
s e S and  it  e <£).  For  each  ir  e , let  <P^.  and  w be  the  mappings 
from  S into  R such  that 


9_(s)  = lim  v (s,t)/t  , 
" t -*•  <*> 


w^s)  = lim  (v  (s,t)  - t • cp7r( s ) } , 
t ->  °° 


for  s s S.  <9^.  is  the  long-run  expected  average  cost,  given  that  the 

start  state  is  s and  that  the  policy  V is  used.  w ( s ) is  the  long- 

run  expected  cost  not  accounted  for  by  ^(s). 

Two  optimality  criteria  will  be  used.  The  first  one  is  the  average 

cost  criterion.  A policy  ir * e $)  is  optimal  according  to  this  criterion 

if  9 jt(s)  < $ (s)  for  s e S and  ir  e 3l),  and  the  policy  is  called 
i r 

average  optimal.  The  second  criterion  is  the  undiscounted  cost  criterion. 

A policy  ir  e is  optimal  according  to  this  criterion  if  it  is  average 

optimal  and,  in  addition,  w ^(s)  < Vj.Cs)  for  s e S such  that 

cp  *(s)  = 9_.Cs)  for  ir  e J).  A policy  which  is  optimal  in  this  sense 
t r r 

is  called  undiscounted  optimal.  This  latter  criterion  has  not  received 
much  attention  in  the  literature.  This  may  be  due  to  the  fact  that  often 
there  is  not  much  to  gain  by  using  this  criterion  instead  of  the  average 
cost  criterion.  The  main  difference  resulting  from  the  use  of  these 
criteria  is  that  the  action  in  the  transient  states  become  more  important 
when  the  undiscounted  cost  criterion  is  used.  To  illustrate  this  point 
further,  an  example  is  included  below. 
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Example:  Consider  the  following  simple  semi-Markov  decision  process. 


The  state  space  is  N and  the  action  spaces  are  (o,l}.  The  times 


between  the  decision  epochs  are  exponentially  distributed  with  the  same 


parameter.  State  0 is  an  absorbing  state.  Consider  states  in  N. 


If  action  0 is  taken,  the  state  0 is  entered  next  with  probability 


one.  If  action  1 is  taken,  the  state  numbered  1 higher  is  entered 


next  with  probability  one.  The  cost  structure  is  simple.  Each  time  a 


state  in  N is  reached,  an  immediate  cost  of  2 units  is  incurred,  and 


each  time  the  state  0 is  entered,  an  immediate  cost  of  1 unit  is 


incurred.  Any  policy  which  chooses  action  0 in  all  the  states  above 


a given  number  is  average  optimal.  The  undiscounted  optimal  policy  is  the 


one  which  always  chooses  action  1.  This  is  clearly  the  desired  policy. 


One  special  reason  for  using  the  undiscounted  cost  criterion  is 


as  follows,  Under  certain  circumstances  there  may  exist  a sequence  of 


average  optimal  policies  7 r^,  Tr^,  ...  such  that  using  tt^  for  the 
first  decision,  7Tg  for  the  second,  T for  the  third,  and  so  on,  leads 


to  a long-run  expected  average  cost  which  is  higher  than  the  optimal 


one.  This  can  easily  be  seen  i’rom  the  example  above.  First  let  T 


be  the  policy  which  chooses  action  1 for  states  numbered  less  than  n 


and  action  0 for  states  numbered  n or  higher.  Each  vn  is  average 


optimal.  But  using  7Tn  at  the  n decision  epoch  for  n = 1,  2,  ...  , 


leads  to  a long-run  expected  average  cost  twice  as  high  as  the  optimal 


one.  Notice  that  since  there  is  a unique  undiscounted  optimal  policy. 


this  situat5on  cannot  occur  when  the  undiscounted  cost  criterion  is  used. 


In  general,  there  is  no  guarantee  for  the  existence  of  a unique  undiscounted 


optimal  policy,  but  often  a unique  undiscounted  optimal  policy  .does  exist 


and  thus  the  undesirable  situation  mentioned  above  can  be  avoided  by 
using  the  more  refined  criterion.  Some  useful  semi-Markov  process  termi- 
nology will  now  be  introduced. 

A state  is  called  transient  if  with  probability  one  it  will  not 
be  reentered  after  some  time.  A state  is  called  recurrent  if  with 
probability  one  it  will  always  be  reentered.  A recurrent  state  is 
positive  recurrent  if  the  expected  time  between  consecutive  visits  of 
this  state  is  fi  "e.  Otherwise,  it  is  called  negative  recurrent.  If 
there  is  a positive  probability  that  a state  is  reached  in  a finite  time 
from  another  state  and  vice  versa,  then  the  two  states  are  said  to  com- 
municate. The  positive  recurrent  states  belong  to  one  or  more  positive 
recurrent  classes  of  states.  Each  positive  recurrent  class  is  a set 
of  positive  recurrent  states  which  communicate  with  each  other-,  but  not 
with  states  outside  the  class.  We  make  the  following  assumptions. 

Assumption  1:  There  is  an  e > 0 such  that 

q(s,a,s’,e)  = 0,  for  s e S,  a e A , s’  eS. 

s 

In  words,  the  time  between  two  consecutive  decision  epochs  is  at 
least  e. 

Assumption  2:  For  each  v s ©f)  and  s e S,  the  expected  cost  in- 
curred and  the  expected  time  elapsed  from  time  t until  the  first 
decision  epoch  after  (or  at)  time  t divided  by  the  time  t have 

zero  as  their  limits  as  t tends  to  infinity,  given  the  start-state 

s and  policy  7 r. 

Faced  with  a particular  semi-Markov  decision  process,  one  may  have 
difficulties  in  showing  -chat  it  satisfies  the  above  assumption.  However, 
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lanwfTfriBiiTigaB; 


we  have  not  heen  able  to  do  without  them.  If  the  semi -Markov  decision 
process  is  a Markov  decision  process , then  the  second  assumption  is 
trivially  satisfied. 

Some  convenient  notation  will  now  be  introduced.  For  each  ir  e ft), 
let  q^  and  T be  the  mappings  from  S X S into  R such  that 

V (s,s‘)  = lim  q(s,a  (s),s',t)  , 
t ->  “ 

= J tdq(s,a  (s),s’,t)  , 
t e R+ 

for  s.s*  e S.  a (s)  is  the  action  chosen  by  ir  in  the  state  s.  For 
7 r 

each  ir  e $),  also  let  and  c be  the  mappings  from  S into  R 
such  that 

Vs)  = D TJS>S,)  > 

s'- eS 

cr(s)  = lim  c(s,a  (s),t)  , 
t -*  °° 

for  s e So  q^Cs^s’)  is  the  probability  that  the  next  state  will  be  s, 
given  the  present  state  s and  policy  tt.  s s ' ) is  q^s^’)  mul- 
tiplied by  the  expected  time  until  the  next  decision  epoch,  given  that 

the  next  state  is  s’o  T (s)  is  the  expected  time  until  the  next  decision 

7T 

epoch,  given  the  present  state  s and  policy  t.  c^s)  is  the  expected 
cost  until  the  next  decision  epoch,  given  the  present  state  s and 
policy  7 r.  Naturally,  we  assume  that  all  these  quantities  exist  and  are 
finite. 

If  the  state  space  is  finite,  it  can  easily  be  shown  that  cp^  and 
w^  satisfy  the  following  equations, 
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\(s)  = £ q^s^-cp^s')  , 

s ' eS 

\(s)  = c^s)  - J)  ys^s^ys')  + £ ^(SjS'J'W^s') 

s gS  s eS 

for  s e S and  ir  e <$)  (see  Denardo  and  Fox  (1968)).  The  expressions 
on  the  right-hand  side  are  obtained  by  conditioning  on  the  time  of  the 
second  decision  epoch  and  the  state  at  that  epoch.  If  7r,7r’  e and 
ir"  e P are  such  that  ir"  uses  ir’  at  the  first  decision  epoch  and  ir 
thereafter,  then 

V^)  = £ V(s's,)*<pir(s^  ’ 

s'eS 

v(®)  = ~ ,£  V^S,S,^*VS^  + ,£  V(8>,,),VB,)  > 

for  s e S.  If  cp^Xs)  < ?T(s)  and  w „(s)  < ^(s)  for  s ? 8,  and 

if,  in  addition,  cp^s)  < cp^.(s)  or  w „(s)  < w^s)  for  some  s e S, 

then  tt"  is  an  improvement  over  7T.  It  can  be  shown  that  ir * is  also 

an  improvement  over  ir  in  that  case  (see  Denardo  and  Fox  (1968)).  This 

motivates  the  following  definitions. 

A policy  ir  is  called  unimprovable  if 

Vs)  - £ v<8'8,)*Vs,)  > 

s'eS 

^■(s)  < Ctt’  (s)  “ £ '^.(s^’Vq^s’)  + 2)  <ljr«(s,s,)*w  (s’ ) , 

s’eS  s’eS 

for  s e S and  ir*  e ^),  assuming  that  all  of  the  expressions  above 
are  well-defined  and  finite.  A policy  ir  is  strictly  unimprovable 
if  it  is  unimprovable  and  if,  in  addition,  equalities  in  the  above 
expression  are  achieved  simultaneously  only  when  ir 1 = TT. 


If  the  state  spate  is  finite,  then  an  unimprovable  policy  is  average 


i 

£ 

f 


optimal  (see  Denardo  and  Fox  (1968)).  If  the  state  space  is  not  finite, 
an  unimprovable  policy  is  not  necessarily  average  optimal  an^  more  (see 
Hordijk  (1974)).  Thus,  some  additional  conditions  must  be  satisfied  in 
order  to  be  guaranteed  that  an  unimprovable  policy  is  optimal.  Such 
conditions  are  given  in  the  next  sections. 

2.  The  Case  of  Constant  Optimal  Expected  Average  Cost. 

For  many  semi-Markov  decision  processes,  the  optimal  long-run  expected 
average  cost  is  constant  (i.e.,  independent  of  the  start-state).  In 
particular,  if  any  state  can  be  reached  from  each  state  (by  using  an 
appropriate  policy)  such  that  the  expected  cost  up  to  the  time  the  state 
is  reached  is  well  defined  and  finite,  then  the  optimal  long-run  expected 
average  cost  must  be  constant.  For  in  this  case,  the  long-run  expected 
average  cost,  given  any  start-state  s and  policy  7r,  can  be  obtained 
for  any  other  start-state  by  using  a policy  whose  actions  coincide  with 
those  of  tt  at  states  which  are  reached  from  s with  a non-zero  proba- 
bility under  7r,  and  otherwise  are  such  that  the  expected  cost  up  to  the 
T-ime  when  s is  reached  is  finite. 

For  each  7T  e let  x be  the  mapping  from  S X S into  R, 

7T  + 

such  that 


x (s.s’)  = lim  E ( £ S(s  ,s‘)3  , 
w s neNt  n 

for  s,s*  e S.  Here,  5 is  the  Kroenecker  delta  function,  given  by 


8(s,s') 


fl  if 


0 if 


s = s*  , 
s / s'  . 


I 


i 

t 

I 
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The  fact  that  x exists  (although  possibly  infinite  valued)  follows 
from  renewal  theory  (see  Smith  (1955) )»  We  assume  that  the  expected 
time  until  the  second  decision  epoch,  given  any  start-state  and  action 


at  the  first  decision  epoch,  non-zero.  This  implies  that  x is 
always  finite  valued. 


Lemma  1:  For  each  ir  e 


*fr(s,s*)  = 2)  x7r(s,s")*qT(s",s')  , 

S €S 


for  s,sf  e S. 


Proof:  For  each  a > 0,  ir  e 55,  let  x^  ^ be  the  mapping  from  S X S 


into  R such  that 
*4* 


,(s,s')=E  (Se  n • 5(sn,s'))  , 


for  s,s'  e S.  Since  x„.  exists, 

7 r 


x ( s , s * ) = lim  ccx^  fs,s’ ) , 
* a _ 0 


for  s,s’  e S.  Now 


V*<s's'>  ■ s„|  WS,S")'WS"'S')  + 


for  s,s*  e S,  where 


\,a^ 


s, s')  = / 


e_atdq(s,air(s),s',t)  . 


This  implies  that 


lim  osc  fs,s‘)  = lim  T)  ax  (s,s")  *a  fs",s' ) 

, _ 7r,cr  ’ „ .M,  ir, a'  (r,a  ’ 

a ->  0 a ->  0 s"eS  ’ ’ 
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or 


XT(s,s')  = 2 fOT  S>S’  € s 


s gS 


Lemma  2;  Lot  e (>  0)  be  as  in  Assumption  1.  Then,  for  each  T e ob, 


x (s.s’) 
t IP  9 ' 


e j 2 z(s',s  ))  <-  • v-nrrr  > for  teR+> 

t,s  " n e V'S^; 


for  states  s and  s*  which  are  positive  recurrent  under  7 r. 

Proof:  Let  ir  be  a policy  in  , and  let  s and  s*  be  positive 

recurrent  states  under  7T.  By  Lemma  1, 


Xj/SjS*)  = £ x^Cs^s’O-E^giSCsSsg)}  . 

S €S 


Using  Lemma  1 repeatedly,  we  obtain 


X (s,s»)  = D X (s,s")‘E  {&(s’,s  )},  for  neW 

r s"eS  71  ir,s  n 


Therefore 


x (s,s’) 

E (S(s',s  )}  < -^7 r- 

•/r,s  v ' n/J  - x (s,s) 


•,  for  n e N 


7T 


Now 


E ( 2 8(s*,s  )}  = L.  CsCsSs  )*P_  Ct  < t |s  )} 

K * n/J  n"N  7T,S  v n 7T,s  n-  nJJ 


neN. 


n < — 

— e 


by  Assumption  1.  The  lemma  now  follows  by  combining  the  two  last  results. 


Lemma  3; 
and 


* 

If  7 r 


is  an  unimprovable  policy  such  that  9 ^(s) 

IT 


is  constant 


D x (s,s‘)'  |w  *(s’)|  < 00  , 

s’eS  tt 


for  s & S and  it  s 


then 


<P  *(s)  • 5)  x^SjsM'V^s’)  < £ x (s,s*)-c  (s * ) , 

7 r s * eS  s ' eS 

for  s e S and  7 re^). 

Proof:  Let  9 be  the  constant  such  that  cp  = 9 ^(s),  for  s e S. 

* 7r 
Since  7 r is  unimprovable, 

c (s’)  - w *(s)  + - £ v(s,'s,,)’w  *(s")  ' 

Ir  v 11  s"eS  " v 

for  s’  e S and  tt  e <^).  Multiplying  both  sides  by  x (s,s')  and 
summing  up  over  s’  e S yields 

Y)  x (s,s')‘C  (s') 
s*eS  r T 

> £ x (s,s»)(w  *(s‘)  + «P*V  (s*)  - 2)  q_(sSs")w  *(s")}  , 

s’eS  7T  s"sS  IT 

for  s e S,  7T  e <£).  The  sums  on  both  sides  of  the  above  inequality 
exists,  since 

£ X (s,s‘){w  *(s')  + 9 ~ V (sf)  - V,  q_(s* ,s")w  (s")}~ 
s’eS  T ir  T s"7s  v 7 r 

< £ x (s,s’)w  *(s’)“  + 2)  x (s^OUs'js'Ow  (s")+ 

s'eS  T 7 r s*,s"€S  T ^ tt 

+ <P“  • f£  x7r(s,s,)‘Vjr(s‘) 

s gS 
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In  words,  y (s,s*)  is  the  expected  number  of  times  the  state  of  the 
system  is  s’  before  a positive  recurrent  state  is  entered  from  another 
state,  given  that  the  start-state  is  s and  that  the  policy  7T  is  used. 


Theorem  4;  If  7T 
constant  and 


is  an  unimprovable  policy  such  that  <P  Js) 

7T 


is 


for 


s e S 


2)  (y_(s,s')  + x _(s,s’))-|w  Js)|  < « , 
s‘eS  T Tr 

and  ir  e & then  7 r*  is  average  optimal. 


Proof:  We  first  show  that 

<f>  (s)  > £ x (s,s*)‘c  (s‘)  , 

r s*eS  T 

for  s e S and  r e & • 

For  each  tt  e $),  let  and  c^  be  the  mappings  from  S X S 

and  S into  R such  tnat 


\(s,s')  = 


’"q^(s,s')>  for  s e T(tt)  , 


0 


, for  s e R(7r)  , 


'c(s)  - q>-V_(s),  for  s e T(tt)  , 

cjs)  H 

' w Js) 


, for  s e R(tt)  , 


7T 


* 


for  s'  e S.  Since  7T  is  unimprovable, 


c Js)  > w (s)  - 2)  <3US,S')’W  * 

r - s’eS  r 


t r 


for  $ e S and  v e ^).  Now 
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D y„(s'Ss)(w  *(s)  - £)  a (s,s ' ) *w  #(s*)r 

seS  7T  s’eS  7T 

< £ y ■ (s",s)«w  *(s)“  + 5)  y (s",s)  £ q_(s,s* )w  *(s’)+ 

“ seS  7T  seS  s‘eS  tt 

= D y (s“,s)w  (s)"  + 2)  (yJsV1)  " 8(s,,,s))*w  *(s’)+ 

seS  T TT  s>€S  w t r 

= D y7r(s,',s)w  js)"  + 5)  y-O^VW  *(s’)+  - w *(s")+ 

seS  7T  s’eS  7 r tt 

< CO  t 

by  the  last  assumption  of  the  theorem.  This  implies  that 

Yj  y_(s",s)’c  (s)”  < °°>  s"  e S,  7T  a • 
seS 

Thus 


5)  yJsV)*c  (s) 
seS  /r 

is  well-defined  and  greater  than  minus  infinity  for  s"  e S and  w e 
Now 


<p  (s)  . lim  E ( £ c(Va  ,t-t  ))/t 
t ->  00  7 neN 


= lim  E,^  { Y)  c(s  ,a  ^c0)}/t 
, „ 7T.s  ' rr  n^  ' ' 


t ->  co  7r*s  neN 


(by  Assumption  2) 


= lim  E (J)  2/  S(s’,s  )*c  (s'))/t 

t -*■  «>  neNt  s«eS  n T 


lim  E ( Yj  Y:  s S(s’,s  )-c  (s'))/t 
t neNt  s'eT(TT)  n v 

+ S £ . 8(sSsn)-cir(s’))/t 

t — c°  fS  neN^  sfeR(7r) 
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cp  + lim  £ E ( ^ 8(s*,s  ))-e  (s*)/t 

t ->  co  s'eT(7r)  3 neNj. 

+ t11"  . § V Eir,s(  S S(sS3n))-(c  (s*)-<P-v 

t -»  <»  s' sR(tt)  neN^. 


using  Assumption  2.  The  first  limit  is  non-negative,  since 


E ( Yj  6(s‘,s  )}  < y (s,s‘),  for  s'  e S 
ir,s  neN  n W 


and  since 


s'eS 


yjr(s,s,)-c7r(s,)~  < co  . 


Therefore 


<P  (s)>cp+  lim  £ E { &(s*,s  )}(c  (s')  r <P*V  (s‘))/t 

W t -co  s * eR (ir)  r's  neN  n W V 


Using  Lebesque’s  bounded  convergence  theorem,  we  obtain 


lim  Z E ( Z 5(s’;S  )}-(c  (s’)  - cp-v  (s'))/t 
t-~  s' eR(7r)  7r> 5 neNt  n W T 


Z x (s,s')-(c  (s')  - cp-v  (s'))  , 
s'eR(TT)  ” T * 


since 


lim  \ J Z 6(s',s  ) 3/t  = < (s,s*),  for  s’  e S , 
t - « *>*  neNt  n T 

..  x (s",s') 

E7T,S{  Z B(s’,sn))/t  < - • ^ (s»  s”)  > for  s'  e s>  s"  e Riv)  > 

3 nefo^  7 V 3 
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2/  x (s",sf)(c  (s’)  - 9*V  (s*))~  <“  . 
S»eS  T T T 


<p_(s)  >9  + Y X (s,s‘)(c  (s') 
s • eS 


9-V7r(s1)) 


Using  Lemma  3,  we  obtain 


\(s)  > 9 


Q.  E*L« 


Corollary  3:  Suppose  that,  for  each  s e S and  7T  e 5),  the  expected 

number  of  decision  epochs  occurring  before  reaching  a state  in  r(tt) 

is  finite.  Then,  if  7T  is  an  unimprovable  policy  such  that  9 #(s) 

* w 

is  constant  and,  in  addition,  w ^(s)  is  bounded,  then  tt  is  average 

7 r 

optimal. 


Proof:  In  view  of  the  theorem  and  the  fact  that  w ^(s)  is  bounded, 

TT 

we  only  need  to  show  that 


2)  y_(s,sr)  < oo  , 
s’eS 

for  s e S and  7T  e • But  this  follows  from  the  first  assumption  of 
the  corollary,  which  completes  the  proof. 


Theorem  6:  If  tt  is  a strictly  unimprovable  policy  such  that  9 #(s) 

IT 

is  constant  and,  in  addition. 
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D (yjs^s’)  + X (s,s'))*  |w  *(s*)i  < 00  J 

s'eS  IT 

for  s s S and  ir  e then  7T*  is  undiscounted  optimal. 


1 roof : Let  ir  be  any  average  optimal  stationary,  deterministic  policy. 

Following  the  proof  of  Lemma  3 and  Theorem  4,  one  can  easily  see  that 

‘ft 

a„  (s)  f a As)  imply  that  cp  (s)  > cp  (s)  for  s e R(ir),  since  tt 
ir  * 7T 

TT  7T 

is  strictly  unimprovable.  This  implies  that  a (s)  = a ,x.( s ) for  s e R(tt) . 

T 7T 

From  the  proof  of  Theorem  4, 


cjs)  > w *(s)  " <L(s,s*)w  *(s')  , 

ir  s * eS  tt 

for  s e S.  This  implies  that 


X yJsVJc^s)  > 2)  y7r(s",s)(w  #(s)  - 2)  \Xs,s')w  *(s')5  * 

seS  seS  t r s’eS  ir 

It  was  shown  in  the  proof  of  Theorem  4 that  these  sums  are  well-defined. 
Wow 


£ yJsVHw  #(s)  - 2)  £JS,S')W 

seS  7T  s'eS  ir 


■ 2)  ^(sVJw  *(s)  - £ yT(s",s)q^.(s,s,)w  #(s») 

■seS  ir  s,s'eS  ir 


= D y~(s",s)w  *(s)  - £ (y„(s",s»)  - 5(s",s,))w  *(s') 

seS  tt  s'eS  7T 


= W *(s")  , 
7T 

for  s"  e S.  Hence 


w Js")  < 2)  y7r(s",s)*c  (s)  , 
TT  seS 
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for 


e S and  7T  e 


It  is  easy  to  check  that 


\(su)  = J)  y7r(s'Ss)*c7r(s)  , 
seS 


for  s"  e S,  so 


W *(s")  < »,(*")  , 

7T 

for  s"  e S. 


Q.  E.  D* 


Corollary  7 : Suppose  that  for  each  s e S and  7T  e the  expected 

number  of  decision  epochs  occurring  before  reaching  a state  in  R(7r) 

* 

is  finite.  Then,  if  ir  is  a strictly  unimprovable  ^ILicy  such  that 

cp  #(s)  is  constant  and,  in  addition,  w #(s)  is  bounded,  then  tt 
7T  TT 

is  undiscounted  optimal. 


Proof : The  proof  proceeds  just  as  in  the  proof  of  Corollary  and  so 

will  not  be  repeated  here. 

3.  The  Case  of  Non-Constant  Optimal  Expected  Average  Cost. 

The  case  when  the  optimal  long-run  expected  average  cost  varies  with 
the  start-state  now  will  be  considered.  The  notation  is  the  same  as  in 
Section  2. 

Lemma  8:  If  7r  is  a policy  such  that 


ll 


( 

( 


i 

i 

j 
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for  s e S and  ir  e §?),  then  9 ^(s)  is  constant  in  each  positive 

tt  . 

recurrent  class  of  states  under  each  policy  7 r e |0» 

Proof;  Let  7r  be  a policy  in  and  let  s be  a state  in  R(tt). 

Using  Lemma  1 repeatedly,  we  obtain 


= Y/  x7r(s,s,)-Eir>s»^(s"^sn))  > 

S €S 


for  n e N and.  su  € S.  This  implies  that 


\,st8<s"'sn>)2;r7S7 TT’ 

7 r 


for  n e N,  and  s"  e S,  since  x (s,s)  > 0.  Now 

x (s,s") 

,£  OSTiT  ' 1 V(s,,)  1 < ” ’ 

s eS  7 r 7T 

because  of  the  second  assumption  of  the  lemma.  Using  Lebesque's  bounded 
convergence  theorem,  we  obtain 


lim  5)  E77-  sCS(s",s  )}*cp  *(s") 
n ->  00  s"eS  n 7T 


2 x (s ,s")*cp  #(s")  , 


or  equivalently, 


lim  E7r,s(<p  *(sn)3  = \(s >*'")•<*  *(s") 

n -»  m * ir  s"eS  7T 


Let  djj.  he  the  mapping  from  S into  R such  that 


1 


■Us")  - £ v(s"'s')-'p  *<s’>  - » *(»”>  > 

S ' sS  T r IT 


for  s"  e S.  is  well-defined  by-  the  first  assumption  of  the  lemma. 

It  can  easily  be  shown  by  induction  on  n that 


V'(.  I Vsi»  - Vs"'*  *(sn»  " * *<*">  * 

1 < n TT  IT 


for  n e N and  s"  e S.  Inserting  s for  s"  in  this  expression 
and  taking  the  limits  as  n goes  to  infinity,  we  obtain 

lim  *r,sl-2  ■V(si))  = lim  Eir,st<P  *(sn)3  " 9 *(s) 

n -»  °°  3 i < n n -»  03  * 7 r tt 


- 2) 

s’eS 


\(s,s')-<f>v^(s')  - q>^(s)  < » . 


The  first  assumption  of  the  lemma  implies  that  d^s’)  > 0,  for  s’  e S 
and  tt  e ^).  Using  this  fact  together  with 

lim  \ J 2 <V(si^  < 09  * 

n ~ i < n F 1 


we  obtain  d^s)  = 0.  But  s e R(7r)  was  chosen  arbitrarily,  so  ■ 
d^s)  = 0 for  o e R(7r).  This  implies  that 

<P  *(s)  = 2,  N ^(3,3')-?  *(s)  , 

7 r s’eR(7 r)  tt 

for  s e R(7r) . This,  in  turn,  implies  that 

<p  #(s)  = lim  E Cep  *(sn)} 

TT  n -*■  03  * IT 

= 2 X (s,s*)-<p  *(s‘)  , 

s’eS  7T 

for  s e R(7r).  Now,  x^( s , s * ) = x (s”,s*),  for  s and  s",  if  they 

belong  to  the  same  positive  recurrent  class  under  tt.  Thus, 
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Q.E.D. 


for  s,s" 


<P  *(s)  = £ x (s,s*)*q>  *(s*)  = cp  ,(s“)  , 

Ti r s 1 eS  7T  v 

in  the  same  positive  recurrent  class  under  ir. 


For  each  7T  e f£) , let  l(ir)  be  the  set  of  positive  recurrent 
classes,  and  for  each  s e S and  z e I(tt),  let  p (s,z)  be  the  prob- 
ability that  class  z is  entered,  given  start-state  s and  policy  ir. 

Lemma  9:  If  7 r is  an  unimprovable  policy  such  that  the  conditions  of 

the  previous  lemma  hold,  and,  in  addition, 


dim  inf 

n ->  oo  s’eT(7r) 


E, 


7T,s 


{5(s',sn)}*<p  ^(s*) 
7 r 


< 0 


for  s € S and  7T  € 


& 


then 


*(*)  < D pjs,z)-cp  , 

ir  zel(ir)  T z 

for  s e S and  ir  e $).  Here,  cp  is  the  long-run  expected  average 

z 

* 

cost  under  7 r , given  that  the  start-state  is  in  the  class  z. 

Proof:  Let  7T  be  any  policy  in  Si)  , and  let  S_  be  the  set  of  states 

" z 

belonging  to  class  z for  each  z e I(tt).  As  in  the  proof  of  Lemma  8, 


<P  *(s)  < lim 
7T  n -*  “ 


inf  E„.  „ 
7 r,  s 


^ *(8n)J 


s lim  ^ 
n -*  oo  s*eR(7r) 


E CB(S%sn)J-cp  /,•) 

7T 
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+ lim  inf  £ E (&(s',s  )3*<P  *(s‘) 

n ->  CO  s t €T(r)  7r-,s  n 7T 


< lim  £ E {S(s»,s  )3-cp  (s')  , 
n - oo  s'eR(7r)  T>*  n 


7T 


for  s e S.  The  last  limit  exists  and  is  finite.  By  Lemma  2, 

X ( s 11  ^ s”* ) 

Vs(8(s,,sn)1  - VS'z)  < J ~7pr-p- rr  for  some  s"  e R(tt),  e > 0 , 

rr  j 

for  s'  e S . s e S and  z 6 l(7r).  Now 


zel(Tr) 


for  s e S.  Therefore,  by  Lebesgue's  bounded  convergence  theorem, 


^ , 2,  , 5„.  ))-V  .(»’) 

n ->  oo  s’eR^)  ir>s  n 7 r 


= £ P (s,z)-cp  , 

zel(ir)  ,r  z 


for  s e S.  We  conclude  that 


V *(s)  < £.  . PT(s,z)*<P,  , 


7T 


zel(7 r) 


for  s c S. 


Let  y^  be  defined  as  in  Section  2. 


Q.  E.  D» 


Theorem  10:  If  r is  an  unimprovable  policy  such  that 
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00 


2)  x7r(s^s,)>  l<p  *(s')!  < *>  , 


s'eS 


ir 


£ U (s,s‘)  + y (s,s*))*|w  #(s*)l  <eo  > 


s'eS  ¥ ^ ir 


lim  inf  % E (&(s',s  )}*<p  (s')  , 

n ->  oo  s ' eT(7r)  ¥’s  n 7 r 

for  s e S and  ir  e jJ),  then  tt*  is  average  optimal. 


Proof:  Let  ir  be  any  policy  in  <&.  By  Lemma  8,  cp  #(s)  is  constant 

ir 

for  s e S_  (z  e l(7T)).  Therefore,  by  Theorem  4,  cp  (s)  < cp  (s),  for 
Z _*■  — 7T 


s e S (z  e l(ir)  ) . Using  Lemma  9 together  with  this,  we  obtain 
z 

cp^*( s ) < cp  (s)  for  s e S.  The  costs  incurred  until  a positive  recurrent 
state  is  reached  do  not  contribute  anything  to  the  average  cost,  since 
it  can  be  shown  (as  in  the  proof  of  Theorem  4)  that  the  expected  cost 

-ft 

until  a positive  recurrent  state  is  reached  is  finite.  Thus,  ir  must 
be  average  optimal. 

Q.  E.  D. 

Corollary  11:  Suppose  that,  for  each  s e S and  ir  e $),  the  expected 

number  of  decision  epochs  recurring  before  reaching  a state  in  R(tt) 
is  finite. 

•X* 

If  ir  is  an  unimprovable  policy  such  that  cp  ^(s)  and  w ^(s) 


* 


ir 


ir 


are  bounded,  then  ir  is  average  optimal. 


Proof:  We  only  need  to  show  that 


D X _(s,s')  < » , 
s'eS  T 
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2)  yJSjS*)  < « , 
s'eS 

for  each  s e S.  The  first  sura  is  finite  by  an  assumption  made  in  Section 
1,  the  second  sum  is  finite  by  Lemma  2 , and  the  third  sum  is  finite  by 
the  first  assumption  of  the  corollary.  Thus,  the  corollary  follows. 

Q.  ]E»  D. 

* 

Theorem  12:  If  7 r is  a strictly  unimprovable  policy  such  that  the 

■K* 

conditions  of  Theorem  10  are  satisfied,  then  7 r is  undiscounted  optimal. 

Proof:  The  proof  proceeds  just  as  in  the  proof  of  Theorem  6,  and  so  will 

not  be  repeated  here. 

Corollary  13:  If  7T  is  a strictly  unimprovable  policy  such  that  the 

conditions  of  Corollary  11  are  satisfied,  then  7 r is  undiscounted 
optimal. 

\ 

' ' See  the  proof  of  Corollary  11. 
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CHAPTER  5 

SEMI-MARKOV  DECISION  PROCESSES  WITH  DISCOUNTING 

In  this  chapter  the  optimization  problem  arising  when  the  costs  are 
discounted  is  investigated.  From  an  economic  viewpoint,  this  problem  is 
somewhat  more  interesting  than  the  problem  without  discounting.  It  has 
been  studied  by  a number  of  investigators  who  have  made  various  assump- 
tions about  the  state  and  action  spaces,  the  motion  of  the  system  and 
the  costs  (see  Section  2 in  Chapter  l).  Here,  the  assumptions  made  by 
other  authors  are  weakened,  and  more  general  results  are  obtained. 

In  Section  1,  there  is  a formal  statement  of  the  problem  to  be 
considered.  It  also  contains  some  preliminary  results.  In  Section  2, 
some  useful  operators  are  introduced.  In  Section  2,  the  optimality 
equation  is  proven.  In  Section  4,  there  are  some  existence  theorems. 

In  Section  5,  policy  improvement  is  considered.  In  Section  6,  necessary 
and  sufficient  conditions  for  optimality  are  presented.  Finally,  in 
Section  7,  there  is  an  analysis  using  the  contraction  properties  of  a 
certain  operator.  An  alternative  set  of  necessary  and  sufficient  con- 
ditions for  optimality  are  obtained. 

1.  Problem  Formulation. 

As  before,  let  S be  the  state  space,  (A  } 0 be  the  set  of  action 

S S 6S 

spaces,  q be  the  law  of  motion,  and  c be  the  cost  function  of  the 
SMDP.  Fpr  each  n in  N,  let  s^,  a^  and  t denote  the  state  of 
the  system,  the  action  and  the  time  of  the  n n decision  epoch,  respec- 
tively. The  first  decision  epoch  is  taken  to  occur  at  time  zero,  so 
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t^  e 0.  Also,  let  t J and.  ct)  denote  the  set  of  all  policies,  the 
set  cf  r bationary  policies  and  the  set  of  deterministic  stationary  policies 
respectively.  Let  A = UseS 

Let  a be  a given  positive  interest  rate,  and  let  c be  the 
mapping  from  S X A into  R such  that 


°a(s’a)  * J 


e“0rt'dc(s,a,t) 


for  a e A for  s e S.  In  other  words,  c (s,a)  is  the  expected 
discounted  cost  incurred  until  the  second  decision  epoch,  given  that 
the  start-state  is  s and  that  the  first  action  is  a.  Naturally,  it 
is  assumed  that  c exists. 

For  each  ?r  in  let  v+,  v“  and  v be  the  three  functions 

7 r ? 7 r 

from  S into  RLUf“b  U {ro3  and  R U (“},  respectively,  such 

Hr  *r 

. -cet 

Vs)  - SirJ  2 » " • 0a(Van>  1 ’ 

’ neN 

-at 

m 2 - e " ’ ca<Van)")  > 

’ neN 

\(s)  = v*(s)  - v~(s)  , 


for  s in  S,  where  E is  the  expectation  operator  and  the  subscripts 

71  and  s indicate  that  the  start-state  is  s and  that  the  policy  7 r 

is  used.  In  words,  v^s)  is  the  total  expected  discounted  cost,  given 

that  the  start-state  is  s and  that  the  policy  w is  used,  v is 

7T 

•j*  M 

the  value  function  of  the  policy  ir.  Clearly,  v^  and  v^  are  well- 
defined  (possibly  infinite-valued).  In  order  that  v^  be  well-defined, 
the  following  assumption  is  made: 


Assvunption  1:  v^.(s)  < 00 > for  s e S,  ir  e(p  . 

Let  v be  the  function  from  S into  R U such  that 

Cv 

v (s)  = inf  v (s)  , 
a ireP 


for  s in  S.  For  purposes  which  will  become  clear  later,  the  following 
assumption  is  made. 

Assumption  2:  v/v(s)  > -*>,  for  s e S. 

' ' ' C// 

If  there  can  be  an  infinite  number  of  decision  epochs  in  a finite 
amount  of  time,  some  of  the  costs  may  unintentionally  be  ignored  by 
the  definition  of  v . In  order  to  eliminate  this  problem,  the  following 
assumption  is  made: 

Assumption  5:  Jt,  < t for  n e W}  = 0,  for  t e R^,  s e S,  ir  e (P. 

Here,  P is  the  probability  operator  and  the  subscripts  ir  and 
s indicate  that  the  start-state  is  s and  that  the  policy  ir  is  used. 
For  purposes  that  will  become  clear  later,  a fourth  assumption  is  made: 


Assumption  k:  Given  e > 0,  there  is  an  m (possibly  depending  on  s) 

such  that 


-at 


^!»Le 


for  7r  in  (P . 


These  assumptions  are  satisfied  trivially  if  c (s,a)  is  non-negative 
for  each  s and  a.  The  following  theorem  gives  some  weaker  conditions 
under  which  the  assumptions  hoxd.  ‘ 
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Theorem  1:  If  is  uniformly  bounded  from  below  and  there  is  a 

(3  < 1 such  that 

-atp 

V'*  > < f>  < 

for  s in  S and  tt  in  P , then  all  the  assumptions  above  hold. 

Proof:  Let  £ be  as  in  the  theorem.  For  each  n e N, 

-at  , . 

_ ( n+1-) 

E le  J 

ir,s 

-at  -a(t  ,,-t  ) 

= E Je  n • E ! (e  n 1 n | s .a  3 ) 

TT}s  tt,s  n n 

-at 

SP'Vs'8  n) 


since 


E fe 
ir,s 


-a(t  ..-t  ) 

n+1  ” |s  ,a  } < p . 
1 n'  n - K 


This  implies  that 


-at  , 

V,s(e  ' 


for  n e N.  For  each  m in  Nj- 

-at 


-at 


v,.‘  ? 8 ">  - s n> 

,n>m  n>m  * 


< D pn-1  - p171"1  • (i-p)"1  . 

n > m 


This  implies  that 
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_ — "W  U 

(1-P)  > V ,t  2 * n!  > vs(  S e 


n < m 


> me"^  • g(tn  < t for  n < ra) 


for  t e R,  and  m e N.  Thus 
+ 


P (t  < t for  n < m)  < eat  * (l-p)"1  • m**1 
IT,  s n - - - v 


for  t e R,  and  m e N.  Assumption  3 follows  by  taking  the  limits  as 
+ 


m goes  to  infinity. 


Let  M be  an  upper  bound  on  -c  (s,a).  Then 

Lv 


-at 

< “'Vs'  2 • ”> 

’ n > m 


< M-p"1”1  • (1-p)"1 


for  m e N.  This  shows  that  the  rest  of  the  assumptions  hold. 


2.  The  Operators  Q,^  and  T^. 


Let  B be  the  set  of  mappings  from  S into  R For  each 


7T  e P,  define  the  operators  and  T from  B into  B by 


<V><s)  ■ E,r,s{e 


v(s0)),  for  s 6 S , 


(T77.v)(s)  = E^tc^s^a.^  + e 2 * v(s2)},  for  s e S , 


for  v in  B.  For  some  v in  B,  the  above  expressions  may  not  be 


well-defined.  Those  functions  will,  however,  not  be  used. 
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Some  compact  notation  will  be  used.  If  u and  v are  functions 
in  B,  then  u < v means  that  u(s)  < v(s)  for  s e S,  u + v is 
the  function  such  that  (u+v)(s)  = u(s)  + v(s)  for  s e S,  and  if  c 
is  a constant,  then  cv  is  the  function  such  that  (cv)(s)  = c*v(s) 
for  s e S,  etc. 

Lemma  2:  If  u and  v in  B are  such  that  u < v,  then  Q^u  < Q^v 

and  T^u  < T^v,  provided  the  expressions  are  well-defined. 

For  each  n e N and  7T  e P,  define  the  operators  Q^.  and  t” 


Kv)(s)  ■ vs|e 


v(s  ,)},  for  s e S , 


j , n . 

(VK*)-W.2  « • ca<Vai)  + e 


i < n 


v(Bn+i)),  for  s e S , 


for  v in  B.  Again,  these  expressions  need  not  always  be  well-defined 
for  each  v in  B.  If,  however,  v is  the  value  function  of  a policy, 
then  the  expressions  are  clearly  well-defined. 

Let  6 be  the  function  (from  S into  R)  which  is  zero  every- 
where. Then 


„ -at. 

<y><s>  = w.  P e 

7 x < n 


Cc^si,ai'*  ” s 6 s , 


v = lim  18  , 
TT  „ „ „ TT 
n ->  00 


for  any  w in  (?’. 
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Lemma  3:  For  each  n e N and  T s T,  Q^v^  an<*  ^7rVo:  are  we^“ 


defined. 


Proof:  Let  e > 0 be  given,  and  let  7 r*  be  an  e-optimal  policy. 

This  means  that  v > v , + e*H  where  | is  the  function  from  S 
into  (l).  This  implies  that 


< (V  + €‘l)"  < V + €-1  • 


a - ' 7r 


Therefore 


S <<V  + £'t>  S <V  + e'i  ' 


since 


,n  - , 


_n 


This  implies  that  Q^v^  is  finite-valued,  and  thus  is  well- 

defined.  Also 


Tnv""  < Tnv  , + e‘1 ] , 
w a - ir  ir'  “•  * 


since 


<1<1 


This  implies  that  Tjv  , is  finite-valued,  and  thus  Tnv  is  well- 

7T  a 7T  a 


defined. 


Lemma  4:  For  each  rr  e (?, 


,n . 


lim  inf  Qjr  > 0 . 
n~»  ^ <*“ 


Proof:  Let  e > 0 be  given.  From  the  proof  of  Lemma  3,  there  is  a 


policy  7 r'  such  that 


for  all  T in  P.  This  implies  that 


lim  Ojjv"  < lira  qV,  + e4 


n -*■  oo 


= e*l  > 


by  Assumption  4.  The  lemma  follows , since  e is  arbitrary. 


3-  The  Optimality  Equation. 


Bellman  (1957)  introduced  the  principle  of  optimality  for  dynamic 


programming.  He  says  (p.  83),  "An  optimal  policy  has  the  property  that 


whatever  the  initial  state  and  initial  decisions  are,  the  remaining 


decisions  must  constitute  an  optimal  policy  with  regard  to  the  state 


resulting  from  the  first  decision. " Since  an  optimal  policy  need  not 


always  exist,  the  principle  has  a limited  potential  use.  More  useful  is 


the  optimality  equation,  given  in  the  theorem  below.  For  a discussion 


of  the  principle  of  optimality  and  the  optimality  equation,  see  Porteus 


(1975a). 


Let  qQ,  be  the  mapping  from  S X A X S into  R such 


ra 

q^(  s,a, s 1 ) = i J 


e”°^dq(  s,a , s ’ ,t)  , 


for  a e A for  s 1 , s e S . 
s 


|iM| 

T.  .1  K.  fi  V*  WWVRwnMM 
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Theorem  f>:  For  each  s in  S, 


v (s)  = inf  (ca(s,a)  + £ <k(s,a,s')*v  (sr)J 

n /-A  o * /~ C 


aeA 


s'eS 


Proof:  The  proof  is  similar  to  the  one  given  in  Ross  (1970,  p.  121) 
for  the  case  when  the  action  spaces  are  finite.  Let  ir1  be  an  e- 
optimal  policy.  This  exists  for  each  e > 0,  since  va(s)  > for 
each  s e S by  Assumption  2.  Then 

v < T v , < T (v  + e*ll)  = T v + Q_(e*1l) 

a - r ir'  - a "•  ir  a irK  -11- 

< V2  + e‘l 

for  all  ir  e Q.  Since  e is  arbitrary,  v^  < T v^  for  tt  e§\  This 
is  equivalent  to 


v (s)  5 inf  tca(s>a)  + D Qqj( ^^a , s * ) (s * ) 3 , 

aeA  s’eS 

s 

for  s e S.  We  now  show  that  this  inequality  also  holds  in  the  opposite 
direction. 

For  each  s e S, 


-at 


Vs)  - W x e " • 

new 


-at. 


sV,.ll«(‘rtl‘t  'E,r'sln>I 


-a(t  -t  ) 

e •<a<Van>K'"2'V1 


Now 


-a(t  -t^) 

ETT,s(n  >1  6 n * °a(  Van}  I WV  £ va(s2} 


To  see  this,  suppose  the  opposite.  Then  there  must  be  a’,  s‘  and  t* 


such  that 


mmtmssssscsiz&. 


^;vg5"  jijwiiSSPiw 


V.  ’•gaMa^-^CTfec"*  • x..^ — 1 '■ 


1 


-o:(t  -t. ) 
. x n r 


ca(Van)|al  = a’>  s2  = s‘>  *2  = t,}  < Vs,)  * 


For  each  n e I,  let  hfi  denote  the  history  of  the  process  up  to  the 
n decision  epoch  (including  the  state  at  that  time).  Let  ir'  be  a 
policy  such  that  for  each  history  h. 


P t , (a  = a|h  = h}  = P (a  = a|h  = (a’js’jt’jh)) 
ir'js'n  1 n tt,s  n+1  1 n+1  ' ’ 3 3 


Then 


-a(t  -tn) 

V<'')-\,s(Se  n V*’1 

n > 1 


< va(s')  , 


which  is  a contradiction.  Therefore 


-at. 


V(s)  ^ +e  ‘ V*2)J 


- Va(s)  • 


This  implies  that 


\{s)  > inf  (ca(s,a)  + J)  ^(s^s')^')}  , 

aeA  s’eS 

s 

for  s e S.  But  this  holds  for  each  7T  in  (?,  so 


Vs)  > inf  tca(s;a)  + 2)  (s*)}  , 

aeA  s'eS 

s 

for  s e S.  Combining  this  with  the  result  above,  the  theorem  follows. 

4.  On  the  Existence  of  Stationary  Optimal  and  Stationary  e-Optimal 
Policies. 

In  this  section  the  existence  of  stationary  optimal  and  stationary 
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e-optimal  policies  is  investigated.  It  is  important  to  distinguish  between 
stationary  optimal  policies  and  optimal  stationary  policies.  While  the 
former  policies  are  truly  optimal,  the  latter  ones  are  only  optimal  in 
the  class  of  stationary  policies.  Conditions  are  given  for  optimal 
stationary  policies  to  be  stationary  optimal  policies. 

Theorem  6:  If  it  is  a stationary  policy  such  that  v = T v , then 

a it  a 

7 r is  optimal. 


Proof:  Since  ir  is  stationary,  we  obtain 

v = Tnv  , 
a 7r  a , 

by  applying  on  both  sides  of  vQ  = T^v^  repeatedly.  This  implies 
that 


a 


lim 

n ->  oo 


Tnv 

7 r a 


lim 
n -*• 00 


> lira  T n6  + lim  inf  Q*V 

ir  V« 


by  Lemma  4.  Thus,  7T  is  optimal. 

Corollary  7:  If  each  A_  is  finite,  then  there  is  a stationary  optimal 

11  S 

policy. 

Proof : The  existence  of  a policy  7T  as  in  the  theorem  is  in  this  case 

guaranteed  by  the  optimality  equation. 
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Corollary  8:  If  there  is  an  optimal  policy,  then  there  is  one  which 

is  stationary. 

Proof:  Let  7T  he  an  optimal  policy.  From  the  proof  Ox'  the  optimality 

equation,  v > Tv.  Since  ir  is  optimal,  we  obtain  Tv  < v But 

va  £ Tir$va  for  a13*  7r*  e so  vq;  = T7rva*  Let  ir"  he  the  stationary 
policy  such  that  T^,,  = T^.  By  the  theorem.,  tt"  is  optimal.  Thus, 
there  is  a stationary  optimal  policy. 

Theorem  Q;  If  for  each  s,s*  e S, 

-at 

E7r,s(S  e ° • S(s  ,s’)) 

’ neN 

is  uniformly  hounded,  then  an  optimal  stationary  policy  is  a stationary 
optimal  policy. 

Proof:  For  each  s,s*  e S,  let  M(s,s*)  be  an  upper  bound  on 

-at 

e n ■ S(vs,))  • 

’ neK 

Let  e > 0 he  given.  Let  v he  a mapping  from  S into  R+  such 
that  v(s‘)  > 0 for  s’  e S and 

J\  M(s,s,),v(st)  < ®°  , 
s’eS 

where  s is  an  element  of  S.  Let  ir  be  a stationary  policy  such 
that 


Tv  < v + e*v  . 

V a - a e 

Such  a policy  exists  by  the  optimality  equation.  Applying  T^  on  both 


* 


I ’ 


i 


'IPO  ! {' 1 9 " * 'I'-/*  ? “ 1 


■vw  r^ry^agaar  ^ 


zzm mmzzza 


sides  of  this  inequality  repeatedly,  we  obtain 

Tirvo!  - va  + e iln^V  ’ 

tor  n € It.  Letting  n go  to  infinity  in  the  expression  above,  te 


obtain 


lim  inf  TV  < v + e £ Q^y 
, ->  oo  T a a neN 


Now 


lim  inf  Tjjv  = li>«  inftT^e  + Q^cP 


n 


TT  05 


n -» «° 


= lim  T^0  + lim  inf  qJ*^ 

'*  rrt 


n 


> V„  ) 

— tt  7 


by  Lemma  4.  Thus 


,n„ 


vtt  - va  + 6 ^ V * 
77  a neN 


and  in  particular, 


vis)  < v (s)  + e D (0jj.v)(s) 
" neN 


< v (s)  + e 5)  M(s,s’)'v(s') 
_ a s'eS 


Let  7T*  be  an  optimal  stationary  policy.  From  above, 


v ,(s)  < v (s)  + e J)  M(s,s’)‘v(s’) 

17  a s’eS 


But  e > 0 is  arbitrary  and  £ M(s,s’)v(s)  is  finite,  so 

sTeS 
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so 


v (s)  < v (s).  The  argument  can  be  repeated  for  each  s e S, 

it  “* 

must  be  optimal. 


7T* 


Theorem  10:  If  for  each  s'  e S, 


• 5(sn,s’)} 


is  uniformly  bounded,  then  there  are  stationary  e-optimal  policies 
for  all  e > 0. 


Proof:  For  each  s’  e S,  let  M(s’)  be  a bound  on 

-cct 


Following  the  proof  of  the  previous  theorem,  we  obtain 

v (s)  < v (s)  + e Yj  M(s')’v(s')  , 
w a s'eS 

for  some  stationary  policy  ir.  Since  e > 0 is  arbitrary  and 

Y)  M(s*)’v(s’)  < 00  , 
s’eS 

the  theorem  follows  directly. 


Corollary  11:  If  there  is  a p < 1 such  that 

OX 

for  s e S and  w e (P,  then  there  are  stationary  e-optimal  policies 
for  arbitrarily  small  e and  every  optimal  stationary  policy  is  a sta- 
tionary optimal  policy. 
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Proof:  We  only  need  to  show  that  the  conditions  of  the  two  previous 


theorems  are  satisfied.  It  is  enough  to  show  that 

-at  .. 

E { £ e n}  < (l-P)"1  , 

¥,s  neN 

for  s e S,  ir  e But  this  follows  from  the  proof  of  Theorem  1. 

5.  Policy  Improvement 

* 

By  the  optimality  equation,  an  optimal  policy,  ir  , must  satisfy 
the  equation 

v „ = inf  Tv  . 
t*  iref  Tv* 

* 

A policy  7T  which  satisfies  this  equation  is  called  unimprovable. 

An  unimprovable  policy  need  not  be  optimal,  as  the  following  example 
shows. 

Example : Consider  a discrete  time  Markov  decision  process  with 

state  space  and  action  space  {0,l}.  If  the  system  enters  state 

0,  the  process  stops.  In  every  other  state  there  are  two  permissible 
actions,  0 and  1.  If  action  0 is  taken  in  state  i(i  > 0),  an 
immediate  cost  a1  is  incurred  and  the  state  0 is  entered.  If  action 
1 is  taken  in  state  i(i  > 0),  no  immediate  costs  are  incurred  and 
the  state  i + 1 is  entered.  Let  p be  a given  discount  factor.  If 
af3  > 1,  then  the  policy  which  always  chooses  action  0 is  unimprovable, 
but  it  is  not  optimal. 

If  there  is  an  s e S such  that 

v ,(s)  > inf  (T  v , ) (s)  , 
v weS>  Tlr 

then  7 r*  is  called  improvable. 


\ 


} 

tr 

4 


Theorem  12 : If  tt*  e (P  and  tt  e are  such  that  Tv  , < v , , then 

ir  ir’  - 7T 

v < T v . . 

IT  - IT  7T* 


Proof:  Applying  T on  both  sides  of  T v , < v repeatedly,  we  obtain 

rr  „ o y.  TT  7T'  —ir 

T v , > Tnv  , , 

7T  TT1  - 7T  7T*  ' 


for  n e N.  Letting  n go  to  infinity  yields 


T v , > lim  inf  Tnv  , = lim  inf{Tn0  + qJV  . ) 
^ n->oo  7rir  n -*■  “ T 


= lim  Tn0  + lim  inf  Qnv  . 

7T  TT  TT 

n -*•  “ n -*•  “ 

> V , 

- TT 

by  Lemma  4.  Thus,  the  theorem  is  proved. 

This  theorem  may  be  useful  for  the  development  of  a policy  improve- 
ment procedure  like  that  of  Howard  (1960),  The  problem  is  that  one  has 
to  avoid  convergence  to  a suboptimal  solution. 


6.  Necessary  and  Sufficient  Conditions  for  Optimality. 

In  Section  5,  it  was  shown  that  an  unimprovable  policy  need  not 
always  be  optimal.  Here,  necessary  and  sufficient  conditions  for  a policy 
to  be  optimal  are  presented.  If  v is  known,  then  the  optimality 

vv 

equation  can  be  used  to  find  out  whether  a given  policy  is  optimal  or 
not.  I£  v is  not  known  in  advance,  the  following  theorems  may  be  more 

U/ 

useful  for  proving  that  a given  policy  is  optimal. 

Theorem  13:  Let  S'  be  the  set  of  s in  S for  which  v^,(s)  is 
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finite.  Let  fy*  be  any  subset  of  $ such  that  for  each  7T  in  (P 


there  is  a 7T*  in  <P  such  that  v , < v . 

TT  - TT 

If  TT  is  an  unimprovable  policy  such  that 


lim  (Q \ *)(s)  = 0 for  s e S',  (r  , 
n » it 


then  7 r is  optimal. 


Proof t We  first  prove  that  v # < Tv  * for  n e N and  TT  e (?.  This 

7T  ~ 7T 

clearly  holds  for  n = 1.  Now 


-at. 


c£v  *KS>  ■ Eir,s^.  2 ' 1 • 'a(si’8i>  + e 

* i < n 


-at 


n+1 


v »<  Vl)J 


TT 


IT 


• V.  | 

; i < n 


-at, 


“a*3!'8!) 


-at  -a(t  _-t  ) 

+ e n-E_  (cJs  ,a  ) + e n n -v  „(s  n)|t  ,s  }} 

nr  e»  rv ' r>  J n ' 4 1 4 XI 

IT 


tt,s  a'  n n' 


Furthermore, 


E {c„,(s  ,a  ) + e 
tt,s  av  n n' 


-a(t  , ,-t  ) 

' n+1  n 


T»(Vl)IVBa1>-,.(sn) 

7T  TT 


* 


since  nr  is  unimprovable.  Therefore 


-at.  -at 

(Tnv  *)(s)  > E { B e ca^si,ai^  + e " ’ v *(sn)} 

i < n 


7T 


IT 


/ *'  1 

’ i < n-x 


l e 


s. ,a. ) + e 


-at 


n-l 


1'  x 


' v ’ *<sn-l)1 
7 r 


> V >t(s)  , 
7 r“ 


for  s e S.  This  implies  that 
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I 


“U-0 

iim Je  (e  ”-v  ,(s)}| 


= lim  E {e 

TTi  S 

Xi  -v  oo  * 


|v  *(s)|) 

7T 


< M • lim  E_  (e  n)  = 0 , 
n -*■  * w>8 


by  Assumption  3*  The  corollary  now  follows  from  the  theorem. 


Theorem  16:  Suppose  that  there  is  an  optimal  policy,  ir.  Then  a policy 

* 

v is  optimal  if  and  only  if 


lim  (Q*V  *)(s)  = 0,  for  s e S’ 


n -*•  oo  Tj- 


Proof : The  if  part  of  the  theorem  follows  from  Theorem  13  by  letting 

(P*  = Ctt ) . The  only  if  part  is  proven  as  follows.  Suppose  that  TT* 
is  optimal.  Then 

(ojy  *)(«)  = (o2y7r)(s)  = vT(s)  - (T”e)(s)  , 

IT 

for  s e S',  n e N.  This  implies  that 

lim  (qJV  *)(s)  = lim  (v^s)  - (t“©)(s))  = 0 , 

for  s e S‘.  This  completes  the  proof  of  the  theorem. 

7*  Norms  and  Contraction  Map^-xngs. 

It  may  sometimes  be  more  convenient  to  work  with  norms  and  con- 
traction mappings.  Denardo  (1967)  did  this,  and  developed  an  elegant 
analysis.  Recently,  Lippman  (1975)  used  these  concepts. 

As  before,  let  \ be  the  function  from  S into  R with  value  1 


everywhere.  Let  Ml  be  a norm  on  B such  that 
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(a)  II 1 II  = 1 , 

(b)  ||u|J  < ||v)|  if  0 < u < v . 

The  sup  norm,  given  by 

|jv||  = sup  j v(  s ) | , 
seS 

is  such  a norm.  Lippman  (1975)  has  considered  other  norms. 

A mapping  T from  B into  B is  called  a contraction  mapping 
if  there  is  a f3  < 1 such  that 

IItv||  < ?||v|l  , 

for  v e B.  Denardo's  n-stage  contraction  condition  is  as  follows. 
There  is  an  n e N and  a £ < 1 such  that 

ll^vll  < pllvl!  , 

for  v e B and  ir  e (p.  We  weaken  the  n-stage  contraction  condition 
so  that  it  reads  as  follows.  For  each  v > 0,  there  is  an  n.  N 
and  a f3  < 1 such  that 

llojvll  < pllvll  , 

for  all  ir  in  (p. 

Lemma  17:  If  there  is  an  n e N and  a £ < 1 such  that 

-at 

We  n)  s 

for  s e S and  ir  e (p,  then  the  sup  norm  satisfies  the  n-stage  con- 
traction condition. 


5b 


■ •"",nT»m,riifn  i h in  nrnr'T~T ~~~wi¥.~iiMPMP  , 


Proof;  We  have 

-at 

llojjyll  = sup | (Q^v)(s)  | = sup|^s(e  nv(sn)}| 
seS  seS 

-at 

< sup  E (e  n • sup | v( s')]} 

“ seS  ,r>s  seS 

-at 

= llv||  • sup  E (e  n}  , 

7r>s 


and  the  lemma  follows. 

Let  p(- , •)  be  a metric  on  a X B such  that  for  u,v  in  B, 
p(u,v)  = ||w||,  where 


fu(s)  - v(s),  if  u(s)  <<»  or  v(s)  < 00  , 
w(s)  = J 

I 0,  if  u(s)  = v(s)  = “ . 

Theorem  18;  If  ||*||  satisfies  the  n-stage  contraction  condition,  then 

* 

a policy  TT  is  optimal  if  and  only  if  7T  is  unimprovable  and 
P(v  *,v  ) < «. 

TT 

Proof:  The  only  if  part  of  the  theorem  is  trivial.  We  now  prove  the 

if  part.  Let  w be  such  that 

v *(s)  - v (s),  if  v (s)  < 00  or  v #(s)  < oo  , 

w(s)  = ' r W 

0,  if  v *(s)  = va(s)  = oo  . 

' 7 T 

Let  n e N and  < 1 be  as  in  the  contraction  condition.  Let  e > 0 
be  given,  and  let  ir  be  a stationary  policy  such  that 

Va  - va  + (e/n)'l  • 
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7T  is  unimprovable,  so  v # < T^v  #.  This  implies  that 

7T  7T 

W < V + (£/n)*l  > 

since  w > 0.  Applying  to  both  sides  of  this  inequality  repeatedly 
yields 

w < 0£w  + (e/n)  2)  41 

i < n 

< 4w  + 6 * 1 * 

since  Q_  *][  < jj_.  Taking  the  norm  of  the  functions  on  both  sides  of  the 
inequality  yields 

lull  < ||<£w  + e • i II 

< llt^ull  + ell  1 II 

< pllwli  + 6 , 

by  the  contraction  condition.  But  e > 0 is  arbitrary,  (3  < 1 and 

IMI  < 00 . This  implies  that  ||w||  = 0,  and  thus  v „ = v . 

__***  (X 
TT 

Corollary  19:  If  7T  is  an  unimprovable  policy  such  that  v #(s)  - 

TT 

v (s)  is  bounded,  and  if  the  condition  of  Lemma  17  is  satisfied,  then 
* 

tt  is  optimal. 

Notice  the  similarity  of  this  corollary  and  Corollary  15. 

Corollary  20:  If  |]*||  satisfies  the  n-stage  contraction  condition, 

if  v is  bounded  below,  and  if  7T  is  an  unimprovable  policy  such 

that  ||  v J|  < «°,  then  tt*  is  optimal. 

7T 
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CHAPTER  4 


OPTIMAL  CONTROL  OF  QUEUEING  SYSTEMS 

There  has  been  a considerable  interest  in  the  control  of  queueing 
systems  in  the  last  decade.  Often  the  control  problems  have  been  formu- 
lated in  the  framework  of  semi-Markov  decision  processes.  The  existence 
of  certain  simple  and  intuitive  optimal  policies  have  been  proven  for 
many  different  queueing  systems.  For  a brief  (but  excellent)  survey  of 
the  literature  in  this  area,  see  Gross  and  Harris  (1974,  pp.  3^4^380>)* 

In  this  chapter,  three  aspects  of  the  control  of  queueing  systems 
are  considered.  In  Section  1,  the  formulation  of  queueing  control  problems 
is  discussed.  Section  2 elaborates  upon  two  general  approaches  to  the 
solution  of  queueing  control  problems.  In  Section  3>  four  different 
methods  for  proving  the  optimality  of  an  unimprovable  policy  are  developed. 

1.  Formulation  of  Queueing  Control  Problems. 

The  formulation  of  queueing  control  problems  plays  an  important 
role  in  the  solution  of  these  problems.  Sometimes,  a queueing  control 
problem  may  be  formulated  in  two  different  but  equivalent  ways,  where 
only  one  is  amenable  to  analysis.  Special  queueing  control  problems 
may  have  special  desirable  formulations.  But  since  a general  formulation 
of  queueing  control  problems  may  yield  a better  perspective,  we  shall 
now  briefly  describe  the  various  components  of  a controllable  queueing 
system. 

A queueing  system  consists  of  an  input  source,  a queue  and  a service 
mechanism.  The  input  source  generates  customers  which  need  certain  services 
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provided  by  the  service  mechanism.  A customer  generated  by  the  input 
source  is  said  to  arrive  at  the  queueing  system.  The  times  between  two 
consecutive  arrivals  are  the  interarrival  times.  On  arrival,  a customer 
either  is  given  service  immediately  or  is  placed  in  the  queue  of  customers 
waiting  to  be  served.  There  may  be  several  customer  classes,  reflecting 
the  special  needs  of  the  customers.  The  service  mechanism  may  consist 
of  one  or  several  service  facilities,  each  of  which  has  a certain  number 
of  servers.  When  the  customers  have  received  their  service (s),  they 
leave  the  system. 

The  control  of  queueing  systems  can  take  various  forms.  Sometimes, 
the  arrival  rate  may  be  adjusted  dynamically.  Other  times,  the  service 
rate(s)  or  the  number  of  active  servers  may  be  controlled. , A third 
possibility  is  to  control  the  order  in  which  the  customers  are  given 
service. 

There  are  various  costs  that  may  need  to  be  considered  when  analyzing 
queueing  systems.  For  example,  there  nay  be  a service  cost  which  is 
incurred  each  time  a customer  is  served.  If  the  server (s)  can  be  turned 
on  and  off,  there  may  be  start-up  and  shut-down  costs  when  the  server (s) 
are  turned  on  and  turned  off,  respectively.  There  may  be  an  idling  cost 
which  is  incurred  at  a positive  and  constant  rate,  for  each  server  when 
he  is  not  giving  service  or  performing  other  useful  duties.  There  may 
be  a customer  holding  cost  which  is  incurred  at  a rate  which  is  a function 
of  the  number  of  customers  in  the  system. 

There  may,  of  course,  be  many  other  types  of  controls  and  costs 
than  those  which  have  been  mentioned  here.  But  surprisingly  many  of  the 
queueing  control  problems  which  have  been  considered  in  the  literature 
fit  the  above  description. 
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By  formulating  a queueing  control  problem  as  a semi -Markov  decision 
process,  the  theory  for  such  processes  may  be  used  in  developing  a solu- 
tion procedure  or  to  prove  that  a given  policy  is  optimal  (or  not).  The 
formulation  is  usually  quite  straightforward.  One  only  has  to  define 
the  state  of  the  system  and  the  decision  epochs.  The  state  space,  the 
set  of  action  spaces,  the  law  of  motion  and  the  cost  function  of  the 
semi-Markov  decision  process  are  then  determined  by  the  specification 
of  the  queueing  system. 

The  definition  of  the  state  of  the  system  is  crucial.  The  state 
must  characterize  the  queueing  system  completely  at  each  decision  epoch. 
Since  a queueing  system  consists  of  an  input  source,  a queue  and  a service 
mechanism,  one  may  define  the  state  of  the  input  source,  the  state  of 
the  queue  and  the  state  of  the  service  mechanism.  The  state  of  the 
system  is  then  given  by  these  three  states.  The  state  space  of  the 
system  may  be  defined  as  the  Cartesian  product  of  the  state  spaces  of 
the  input  source,. the  queue  and  the  service  mechanism,  respectively. 

The  state  space  of  a queueing  system  is  often  countable.  If  the 
input  source,  the  queue  and  the  service  mechanism  all  have  countable 
state  spaces,  then  the  state  of  the  system  is  countable. 

Consider  the  state  space  of  the  queue.  Suppose  that  there  is  a 
countable  number  of  customer  classes.  If  the  state  of  the  queue  is 
defined  as  the  vector  whose  i^  component  indicates  the  number  of  cus- 
tomers in  class  i (for  each  i e l),  then  the  state  space  of  the  queue 
is  countable.  This  follows  from  the  fact  that  there  are  only  a finite 
number  of  customers  in  the  queue  at  any  given  time. 

Consider  the  state  space  of  the  service  mechanism.  One  case  is 
the  system  which  can  be  controlled  by  turning  serving  on  or  off.  For 


this  case,  if  there  is  a countable  number  of  servers,  and  if  the  state 

of  the  service  mechanism  is  defined  as  the  vector  whose  1 component 

indicates  whether  the  i^  server  is  on  or  off  (for  each  i e N),  then 

the  state  space  of  the  service  mechanism  is  countable.  For  a more  general 

case,  suppose  new  that  the  service  rate  of  each  server  may  be  adjusted 

to  a countable  number  of  levels.  Also  suppose  that  there  are  a countable 

number  of  servers  and  that  the  service  rate  is  only  non-zero  for  a finite 

number  of  servers  at  any  given  point  in  time.  If  the  state  of  the  service 

"fcii 

mechanism  is  defined  as  the  vector  whose  i component  indicates  the 
level  of  the  service  rate  of  the  i^1  server,  then  the  state  space  is  still 
countable. 

The  definition  of  the  decision  epochs  is  also  crucial.  As  mentioned 
before,  the  state  of  the  system  must  characterize  the  queueing  system 
completely  at  each  decision  epoch.  The  most  natural  way  to  define  the 
decision  epochs  is  by  letting  them  be  the  epochs  when  the  state  of  the 
system  changes.  If  the  state  of  the  system  (as  it  happens  to  be  defined) 
does  not  characterize  the  queueing  system  completely  at  each  of  these 
decision  epochs,  one  can  try  to  eliminate  some  of  the  decision  epochs. 

Sometimes  it  may  be  desirable  to  have  the  decision  epochs  equally 
spaced  in  time.  In  this  case,  the  decision  epochs  are  determined  by 
specifying  the  length  of  time  between  two  consecutive  decision  epochs. 
Magazine  (1971)  used  this  approach.  Other  times,  it  may  be  desirable 
to  define  the  decision  epochs  such  that  the  times  between  two  consecu- 
tive decision  epochs  are  independent  and  identically  distributed  random 
variables.  Lippman  (1975)  used  this  approach.  Both  of  these  ways  of 
defining  the  decision  epochs  are  motivated  by  a certain  solution  method 
which  will  be  elaborated  upon  in  the  next  section. 
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2.  Analytical  Solution  Methods. 

A large  variety  of  queueing  control  problems  have  been  successfully 
analyzed  by  a number  of  investigators.  Their  successes  have  to  some 
extent  depended  on  the  special  features  of  the  problems  they  considered. 

But  many  of  the  queueing  problems  also  have  much  in  common.  Therefore, 
there  is  some  basis  for  developing  general  approaches  for  solving  them. 

Prabhu  and  Stidham  (1973)  attempted  to  develop  a unified  view  of  the 
different  approaches  that  have  been  used  previously. 

If  the  state  and  action  spaces  are  finite,  then  there  are  well- 
known  (policy  improvement,  policy  iteration)  algorithms  for  finding  an 
optimal  policy.  But  in  the  context  of  queueing  systems,  one  is  often 
more  interested  in  showing  that  there  is  an  optimal  policy  of  a simple 
and  intuitive  form.  As  a by-product  of  this,  one  may  perhaps  develop 
especially  efficient  algorithms  for  finding  an  optimal  policy.  Two 
general  approaches  for  analyzing  queuing  systems  will  now  be  presented. 

The  first  approach  consists  of  solving  the  problem  for  one  period 
(stage)  and  then  extending  the  results  to  arbitrarily  many  periods  by 
an  inductive  argument.  This  approach  was  initially  used  for  solving 
inventory  problems  (e.g.  by  Iglehart  (1963)).  Because  of  the  similarity 
between  queueing  and  inventory  problems,  the  approach  was  later  adopted  j 

by  queueing  theoreticians.  McGill  (19&9)  used  the  approach  in  his  analyses  1 

I 

of  the  m/m/c  queueing  system  with  controllable  servers.  A full  develop-  i 

ment  of  this  approach  can  be  found  in  Porteus  (1975b). 

This  approach  has  two  advantages.  First,  the  one-period  problem  is 
usually  easier  to  analyze  than  the  infinite  period  problem.  A successful 
analysis  solves  both  the  finite  and  infinite  horizon  problems. 
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However,  this  approach  of  first  solving  the  one-period  problem  can 
also  have  its  disadvantages.  In  fact,  for  many  queueing  problems,  the 
one-period  problem  is  rather  meaningless.  One  reason  is  that  the  length 
of  the  first  period  may  not  be  nearly  the  same  for  different  start-states 
and  different  actions.  Furthermore,  many  important  costs  may  be  neglected 
in  the  one-period  problem  (e.g. , switching  costs).  Nevertheless,  the 
approach  is  still  attractive  for  many  problems. 

The  second  approach  consists  of  restricting  one*s  search  for  an 
optimal  policy  to  a small  class  of  stationary  policies  (hopefully  not 
excluding  the  optimal  policy)  and  then  proving  that  the  policy  which  is 
optimal  in  this  class  is  also  optimal  among  all  policies.  To  prove  that 
a policy  believed  to  be  optimal  is  indeed  optimal  among  all  policies, 
one  usually  only  has  to  prove  that  the  policy  is  unimprovable.  This 
approach  has  been  used  by,  among  others,  Reed  (1974a),  (1974b). 

This  approach  has  the  advantage  that  it  usually  only  requires  the 
analysis  of  relatively  simple  stationary  policies.  If  one  can  obtain 
an  explicit  expression  for  the  value  functions  of  these  policies,  then 
it  is  usually  a simple  matter  to  prove  when  one  of  these  policies  is  un- 
improvable (and  thus  probably  optimal).  Even  if  such  explicit  results 
cannot  be  obtained,  the  approach  may  still  be  used  with  success  (e.g., 
see  Orhenyi  (1976)). 

The  disadvantage  of  the  approach  lies  in  the  fact  that  an  unimprovable 
policy  need  not  necessarily  be  optimal.  In  the  previous  chapters,  several 
conditions  for  an  unimprovable  policy  to  be  optimal  were  given.  For 
example,  when  discounting  is  used,  it  was  shown  that  if  the  value  function 
of  the  unimprovable  policy  is  bounded,  then  the  policy  is  optimal. 


65 


But  queueing  control  problems  are  often  characterized  by  giving 
rise  to  unbounded  value  functions.  This  is  often  due  to  the  holding 
costs  Icing  unbounded.  In  the  next  section,  it  is  shown  how  this 
problem  can  be  solved. 


. Solutions  to  the  Problem  of  Unbounded  Costs. 


We  now  consider  the  problem  of  unbounded  costs  with  discounting, 
and  develop  four  different  methods  for  proving  that  an  unimprovable  policy 
is  optimal.  The  assumptions  of  chapter  3 are  retained  here. 


\m\ 


3.1  A Reformulation. 

Perhaps  the  easiest  way  to  solve  the  problem  of  unbounded  costs 
is  by  reformulating  the  cost  structure  of  the  system  under  consideration 
in  such  a way  that  the  costs  become  bounded.  There  is,  howe/er,  no 
single  receipe  for  doing  this.  Different  problems  may  require  different 
reformulations.  Here,  an  idea  of  Bell  (1971)  is  generalized. 

For  the  sake  of  simplicity,  suppose  that  the  expected  .discounted- 
cost  excluding  the  Cost  .due  to  holding  customers  in  the  system  is  bounded. 
Also  suppose  that  there  are  m customer  classes  and  that  a holding  cost 
is  incurred  at  a rate  which  is  a given  function,  h,  of  the  number 
of  customers  present  in  each  customer  class.  Define  the  state  of  the 


queue  as  indicated  in  Section  1. 


For  each  n e N,  let  t denote  the  time  of  the  n change  in 

the  state  of  the  queue  and  let  yn  denote  the  state  of  the  queue  immed- 
iately after  the  change.  Without  loss  of  generality,  assume  that  t^  = 0. 
For  each  policy  ir  and  state  s,  let  v^(s)  denote  the  expected  dis- 
counted holding  cost,  given  that  the  policy  ir  is  used  and  that  the 
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start  state  is  s.  Clearly 


vVs)  = E ,{  T)  f h(y  )e"0:tdt} 
T T’s  ns»\ 


’ 5 h(yl>  + Eir,s(  X.Ktt)  - h(V)e 


for  each  s e S and  ir  e (?. 

Now,  reformulate  the  holding  cost  structure  such  that  at  each  time 
tn(n  >1),  the  holding  cost 


xn  ' - h(3rn-l)> 


is  incurred.  Formally,  we  choose  to  include  the  cost  xfi  in  the  costs 
incurred  in  the  period  from  1 to  tfi(n  > 1).  For  each  start-state 
s and  policy  ir  used,  the  expected  discounted  holding  cost  becomes 


V<s)  - 5 h^i'  ' 


Thus , the  problem  before  the  reformulation  is  equivalent  to  the  problem 

after  the  reformulation  with  regard  to  optimal  policies. 

Assume  that  the  number  of  customers  in  each  customer  class  only 

can  change  by  one  at  a time  and  that  changes  in  different  customer  classes 

cannot  occur  simultaneously.  Let  Y denote  the  state  space  of  the  queue, 

and  for  each  i(<  m) , let  ox  denote  the  m-vector  whose  components 

"bti 

are  all  zero  except  for  the  x one  which  is  equal  to  one.  We  can  now 
state  the  following  theorem. 


Theorem  1;  If  for  each  policy  tt , 


-at 

V 2 e n5 

" nr 
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is  uniformly  bounded,  and  if  there  is  an  M < “ such  that 


|h(y  + u^  - h(y)  | < M 


* j 

■I 

£ 


for  1 < i < m and  y e Y,  then  every  unimprovable  policy  is  optimal. 

Proof:  Under  the  conditions  of  the  theorem,  the  expected  discounted 

holding  cost  after  the  reformulation  is  bounded.  Therefore,  any  policy 
which  is  unimprovable  for  the  problem  after  the  reformulation  is  optimal 
for  that  problem.  But  the  optimal  policies  are  the  same  for  both  problems. 
The  unimprovable  policies  are  also  the  same  for  both  problems.  Therefore, 
we  conclude  that  a policy  which  is  unimprovable  for  the  original  problem 
is  also  optimal. 


[ 


Example  (The  m/g/1  queueing  system  with  removable  server): 

Excluding  the  policies  which  turn  the  server  on  and  off  repeatedly 
at  a decision  epoch,  the  expected  discounted  cost  excluding  those  due 
to  holding  custdmers  in  the  system  is  bounded.  Let  \ be  the  arrival 
rate  of  the  customers,  and  let  co(<  l)  be  the  Laplace  transform  of  the 
service  times  (with  its  parameter  being  equal  to  the  interest  rate  a). 

Let  (t * } be  the  sequence  of  times  when  customers  arrive,  and 

let  (t^)n  „ be  the  sequence  of  times  when  customers  depart.  It  can 
easily  be  shown  that  for  each  policy  tt  used  and  each  start-state  s, 


and 


> 


-at" 

7 neN 


< ~ < co 
- 1-0) 
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the  first  : «n- 


Since  (t  } is  a subsequence  of  (t‘)  „ !J  (t") 

n neN  n neN  n ner 

dition  of  the  theorem  holds. 

If  the  slope  of  h is  bounded  (in  this  case  h is  a function  of 
one  variable),  then  the  second  condition  of  the  theorem  holds.  Thus, 
if  the  slope  of  the  holding  cost  function  is  bounded,  then  every  unim- 
provable policy  is  optimal.  This  is  just  the  assumption  made  by  Blackburn 
(1971)  when  he  considered  the  convex  holding  cost  model. 

3.2  Comparison  with  the  Policy  which  Shuts  Down  the  System. 

Assume  as  before  that  the  customer  holding  cost  is  incurred  at  a 

rate  h(y  ) in  each  inter .al  (t  ,t  Also  assume  that  h is  such 

wn  ' n n+1 

that 

0 < h(x)  < h(y) 

for  x < y and  x e Y,  y e Y. 

Assume  that  the  system  can  be  shut  down  at  any  decision  epoch  and 
that  the  shut-down  cost  is  bounded  uniformly  from  above.  Lpt  7T^  denote 
the  policy  whicn  always  shuts  the  system  down  (or  leaves  it  off) . Assume 
that  when  the  policy  ir ^ is  used  the  total  number  of  customers  present 
in  each  customer  class  is  at  a maximum  at  all  times  for  any  given  start- 
state. 

* 

Theorem  2;  If  7T  is  an  unimprovable  policy  such  thst,  for  each  s e S, 

V *(s)  < V (s)  < ~ , 

TT  0 

* 

then  tt  is  optimal. 

Proof:  By  Theorem  13  in  Chapter  J>.  we  only  need  to  show  that 
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-at 

lim  " • v - 0 
n -*•  m * IT 

for  each  s e S and  v e (9.  Here  (tn)neN  is  the  sequence  of  the 
times  of  the  decision  epochs. 

For  each  s e S,  let  R(s)  denote  the  expected  discounted  shut- 
down cost  when  the  system  is  in  state  s and  the  policy  7Tq  is  used. 
For  each  IT  e <?,  s e S and  t e R,  let  x^s.t)  denote  the  discounted 
holding  cost  incurred  from  time  t onward  (the  discounting  starting 
at  time  0) , given  tha’  the  start-state  is  s and  that  the  policy  7Tq 
is  used.  It  follows  from  our  assumptions  that 

x (s,t)  < x (s,t),  for  t e E,  s e S,  ir  e (P  . 

r ttq 

Now 


so 


(s)  = r(s)  + ECx^  (s,0)),  for  s e S ,, 


E{x  (s,o)}  < for  s e S . 

7 r 
0 

For  each  v e (P  and  s e S,  let  (t  (ir,s))  be  the  sequence  of 
the  tames  of  the  decision  epochs , given  that  the  start-state  is  s and 
that  the  policy  v is  used. 

Choose  a 7T  e <P,  and  for  each  n e N,  let  7Tr  be  the  policy  which 

ir. 

follows  TT  until  the  n decision  epoch  and  then  shuts  down  the  system. 
Then 

-at  . 

- E(\<s’V,r’s))! 

= E(1(tn(ir,s)  < t)  • 
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But 


lim  E{x  (s,t)}  = 0 , for  s e S , 
t -*  °°  ^ 


since 


-at 


E(x  (s,0)3  = lim  E { f e"^0  • h(y )dt)  < 00 , for  s e S , 
0 t h.  co  V h t 


where  y^  denotes  the  state  of  the  queue  at  time  t.  Therefore 


-at 

lim  E_  (e  n • v (a  } } 
",  s 


n ->  » 


V n* 


-at  -at 

n.„/_  m . n . h 


- llm  \,Sle  X*  STT,  ste 

n n -»  00  ’ 0 


-at 


< lim  E (e  n • H(s  ) 3 , 

~ n _>  00  11  >s  n 


for  s e S.  Let  M be  a finite  upper  bound  on  R(s).  Then 


-at  -at 

iim  E {e  n • v (.„))  <M  * lim  E L (e  n} 
n -»■ 00  0 n ->  00  3 


= 0,  for  s e S 


since 


n -»  w 

This  completes  the  proof. 


lim  P (t  < t}  = 0.  for  t e R,  s e S . 
n“ 


Q.E. 


Example  (The  m/g/1  queueing  system  with  removable  server): 

The  state  of  the  system  is  defined  as  a pair  of  integers  (i,j), 
where  i denotes  the  number  of  customers  in  the  system  and 


0 


0,  if  the  server  is  off 

1,  if  the  server  is  on  . 


It  is  easy  to  find  that 


for  3 = 


0,ieNQ, 


v_  (i,o)  -< 

"o 


R2  + D * h(i+kA  for  5 = 1)  i e Nc 


Therefore,  if  ir  is  an  unimprovable  policy  such  that 


v^ijj)  < (i,j),  for  j e (0,1),  i e NQ  , 


and  if 


then  ir  is  optimal. 


3.3  Comparison  with  the  Policy  which  Minimizes  the  Expected  Discounted 


Holding  Cost. 

Suppose  that  there  is  a policy  which  minimizes  the  expected  discounted 

holding  cost,  and  let  ir^  denote  such  a policy.  For  each  ir  e (P  and 
nil 

s e S,  let  v (s)  denote  the  expected  discounted  cost  excluding  the 
holding  costs,  given  that  the  start-state  is  s and  that  the  policy  ir 


is  used.  Then 


v_(s)  = v*\s)  + v£h(s),  for  s e S,  IT  e (f  . 


Let  p be  a metric  defined  as  in  Chapter  3*  Let  A be  the  binary 


operator  such  that 
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x Ay  = min(x,y),  for  x e R,  y e R . 


We  are  now  ready  to  state  the  following  theorem. 


* 


Theorem  3:  If  ir  is  an  unimprovable  policy  such  that  v # < v and, 

ir  F0 

in  addition, 


p(v”h,  v?  A vj?)  < ",  for  7 T e & , 


V ^0  * 


then  ir  is  optimal. 


Proof:  Ey  Theorem  18  of  Chapter  3,  we  only  have  to  show  that 


p(v  v „ A v ) < for  7T  e ^ . 


*’  v * 
7T  7 r 


7T 


But 


p(v  *'  V A V s p(v>  XA  V 

7T  7T  0 0 

,/  h , nh\  / h , nh>>  A , h . nh^ 

IT  TT  ' ' K TT  TT  ' TT  TT 

0 0 0 0 

. ,,  h , nh\  / h , nhs  A , h , ntuv 
< p((v  + v ),  (v  + v ) /\  (v  + v }) 

- V 7T0  7T  ” K TTq  IT  ‘ ' V 7T0  7T  ’ ' 


^ , nh  nh  a °h\ 

< p(v  , V /\  V ) 

"o  0 


< “,  for  7T  e (?  . 


Q.E.D* 


Example  (The  m/g/1  queueing  system  with  removable  server): 

In  this  case,  let  ||*||  be  the  sup  norm.  Excluding  those  policies 
which  turn  the  server  on  and  off  repeatedly  at  a decision  epoch,  then 
the  contraction  condition  of  Section  7 of  Chapter  3 is  satisfied.  For 
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any  v e 


< « . 


We  conclude  that  if  7 r is  an  unimprovable  policy  such  that  v # < v ' , 

* V O 

then  7T  is  optimal. 


3.4  Comparison  with  a Policy  which  Minimizes  the  Expected  Discounted 

Holding  Cost  until  a Finite  Set  of  States  is  Reached. 

We  now  generalize  the  result  of  Section  3.  This  time,  let  7rQ 

denote  a policy  which  minimizes  the  expected  discounted  holding  cost 

incurred  until  a given,  finite  set  of  states  is  reached.  Assume  that 
h 

v is  finite-valued.  Let  p be  defined  as  before. 

i * 

Theorem  4:  If  tt  is  an  unimprovable  policy  such  that  v „ < v 

* — 7T 

7 r o 

and,  in  addition, 


, nh 
p(v  , 

V 


nh 

v 

7 r„ 


A vf)  < ®°,  for  7T  e (P  , 


then  7r  is  optimal. 


Proof:  Since  7 minimizes  the  expected  discounted  holding  cost  in- 


curred until  a given,  finite  set  of  states  is  reached,  and  since  v. 


h 


TT, 


0 


is  finite -valued,  there  must  exist  an  M < 03  such  that 


vjj.  (s)  < v^(s)  + M,  for  s e S,  7r  e(?  . 


r .1. 


Now 


p(v  *,  V *Av  ) < p(v  , v Ay  ) 

TT  IT  0 0 


= p(v  , V A (V  + V!r1)  ) 

K TTq  TTq  ' ' v 7T  TT 


< p(v  , v A (v^  - M0  + V?1)) 
~ TTq  r0  "o  V 


< pfv*,  vnh  A(vnn  - MS)) 

- TTq  TTq  ' ' V 7T 


nh 


- , nh  nh  a nh,  , 

< p(v  , v /\V  ) + 

r0  0 

< for  TT  e . 


M 


*X’ 

Theorem  18  of  Chapter  3 now  implies  that  tt  is  optimal. 


Q.E.D. 


Example  (The  m/g/1  queueing  system  with  removable  server): 

Let  $)  be  the  set  of  policies  such  that  each  policy  in  always 

turns  the  server  on  (or  keeps  him  on)  when  the  number  of  customers  in 

the  system  is  sufficiently  large.  It  is  easy  to  show  that,  for  each 

TT  e $ , either  v^(s)  is  finite-valued,  for  each  s in  S or  it  is 

infinite-valued  for  each  s in  S.  In  the  latter  case,  all  policies 

may  be  regarded  as  optimal.  Therefore,  we  now  focus  on  the  former  case 
h 

where  v^(s)  is  always  finite -valued. 

Clearly,  7rQ  in  the  theorem  may  be  any  policy  in  J$j  . As  before, 

nh  nh  a nh. 


, nn  nh  a nh,  ^ ~ n 

p(v_  , V _ A y_  ) < «>,  for  tt  e Cr  . 

"o  0 


Therefore,  we  conclude  that  if  ir  is  an  unimprovable  policy  in 


in  , 


* 


then  r is  also  optimal. 
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